openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.42k stars 1.38k forks source link

libvirt: Unable to access web console #1007

Closed rhopp closed 3 years ago

rhopp commented 5 years ago

Version

$ openshift-install version
v0.9.0-master

(compiled from master)

Platform (aws|libvirt|openstack):

libvirt

What happened?

I'm trying to install openshift 4 using this installer. It seems, that everything was OK. I've done all the steps described in here. Installation was ok, I was able to login using oc with credentials from the installation output, but I'm not able to access web console.

Looking at openshift-console project, everything seems ok:

OUTPUT ``` ╭─rhopp@dhcp-10-40-4-106 ~/go/src/github.com/openshift/installer ‹master*› ╰─$ oc project openshift-console Already on project "openshift-console" on server "https://test1-api.tt.testing:6443". ╭─rhopp@dhcp-10-40-4-106 ~/go/src/github.com/openshift/installer ‹master*› ╰─$ oc get all NAME READY STATUS RESTARTS AGE pod/console-operator-79b8b8cb8d-cgpfn 1/1 Running 1 1h pod/openshift-console-6ddfcc76b5-2kmpx 1/1 Running 0 1h pod/openshift-console-6ddfcc76b5-sp5zm 1/1 Running 0 1h pod/openshift-console-6ddfcc76b5-z52hq 1/1 Running 0 1h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/console ClusterIP 172.30.198.57 443/TCP 1h NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deployment.apps/console-operator 1 1 1 1 1h deployment.apps/openshift-console 3 3 3 3 1h NAME DESIRED CURRENT READY AGE replicaset.apps/console-operator-79b8b8cb8d 1 1 1 1h replicaset.apps/openshift-console-6ddfcc76b5 3 3 3 1h NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route.route.openshift.io/console console-openshift-console.apps.test1.tt.testing console https reencrypt/Redirect None ```

The pods are running, service and route are up, but accessing https://console-openshift-console.apps.test1.tt.testing in browser says it couldn't resolve IP address.

As part of the setup I've configured dnsmasq as it was described in the libvirt guide. For example, ping test1-api.tt.testing works as expected, but ping console-openshift-console.apps.test1.tt.testing throws:

ping: console-openshift-console.apps.test1.tt.testing: Name or service not known

What you expected to happen?

Web console to be accessible.

How to reproduce it (as minimally and precisely as possible)?

Follow https://github.com/openshift/installer/blob/master/docs/dev/libvirt-howto.md (my host machine is Fedora 29)

INSTALLATION OUTPUT ``` ╭─rhopp@localhost ~/go/src/github.com/openshift/installer/bin ‹master*› ╰─$ ./openshift-install create cluster ? SSH Public Key [Use arrows to move, type to filter, ? for more help] /home/rhopp/.ssh/gitlab.cee.key.pub > ? SSH Public Key [Use arrows to move, type to filter, ? for more help] > /home/rhopp/.ssh/gitlab.cee.key.pub ? SSH Public Key /home/rhopp/.ssh/gitlab.cee.key.pub ? Platform [Use arrows to move, type to filter] > aws libvirt openstack ? Platform [Use arrows to move, type to filter] aws > libvirt openstack ? Platform libvirt ? Libvirt Connection URI [? for help] (qemu+tcp://192.168.122.1/system) ? Libvirt Connection URI qemu+tcp://192.168.122.1/system ? Base Domain [? for help] tt.testing ? Base Domain tt.testing ? Cluster Name [? for help] test1 ? Cluster Name test1 ? Pull Secret [? for help] ************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************* INFO Fetching OS image: redhat-coreos-maipo-47.247-qemu.qcow2.gz INFO Creating cluster... INFO Waiting up to 30m0s for the Kubernetes API... INFO API v1.11.0+e3fa228 up INFO Waiting up to 30m0s for the bootstrap-complete event... INFO Destroying the bootstrap resources... INFO Waiting up to 10m0s for the openshift-console route to be created... INFO Install complete! INFO Run 'export KUBECONFIG=/home/rhopp/go/src/github.com/openshift/installer/bin/auth/kubeconfig' to manage the cluster with 'oc', the OpenShift CLI. INFO The cluster is ready when 'oc login -u kubeadmin -p 5tQwM-fXfkC-MIeAH-BmLeN' succeeds (wait a few minutes). INFO Access the OpenShift web-console here: https://console-openshift-console.apps.test1.tt.testing INFO Login to the console with user: kubeadmin, password: 5tQwM-fXfkC-MIeAH-BmLeN ```
crawford commented 5 years ago

Duplicate of https://github.com/openshift/installer/issues/411.

openshift-ci-robot commented 5 years ago

@crawford: Closing this issue.

In response to [this](https://github.com/openshift/installer/issues/1007#issuecomment-452040286): >Duplicate of https://github.com/openshift/installer/issues/411. > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
wking commented 5 years ago

411 was closed, since AWS works. Reopening for libvirt.

wking commented 5 years ago

Docs in flight with #1371

ghost commented 5 years ago

Hi,

Does this working? #1371 he responds by all wildcard?

Best Regards, Fábio Sbano

zeenix commented 5 years ago

90b0d45 only documents a workaround, unfortunately.

/reopen

openshift-ci-robot commented 5 years ago

@zeenix: Reopened this issue.

In response to [this](https://github.com/openshift/installer/issues/1007#issuecomment-494868514): >90b0d45 only documents a workaround, unfortunately. > >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
sgreene570 commented 5 years ago

Has anyone had luck with the work around posted in 90b0d45 recently? My libvirt cluster does not bring up the console operator with or without the documented workaround.

sgreene570 commented 5 years ago

I tried setting the oauth hostname statically without wildcards in my dnsmasq config and im still getting oauth console errors. See below.

dnsmasq config

~$ cat /etc/NetworkManager/dnsmasq.d/openshift.conf 
server=/tt.testing/192.168.126.1
address=/.apps.tt.testing/192.168.126.51
address=/oauth-openshift.apps.test1.tt.testing/192.168.126.51

Sanity check that hostname is resolving to proper node IP

~$ ping oauth-openshift.apps.test1.tt.testing
PING oauth-openshift.apps.test1.tt.testing (192.168.126.51) 56(84) bytes of data.
64 bytes from 192.168.126.51 (192.168.126.51): icmp_seq=1 ttl=64 time=0.114 ms
64 bytes from 192.168.126.51 (192.168.126.51): icmp_seq=2 ttl=64 time=0.136 ms

Output of openshift-console crashed pod logs

~$ oc logs -f console-67dbf7f789-k4gqg  
2019/05/30 22:51:45 cmd/main: cookies are secure!
2019/05/30 22:51:45 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.test1.tt.testing/oauth/token failed: Head https://oauth-openshift.apps.test1.tt.testing: dial tcp: lookup oauth-openshift.apps.test1.tt.testing on 172.30.0.10:53: no such host

Am I missing something?

zeenix commented 5 years ago

Has anyone had luck with the work around posted in 90b0d45 recently?

I just did and except for the usual timeout issue, the cluster came up all good afaict.

zeenix commented 5 years ago

/priority important-longterm

zeenix commented 5 years ago

@cfergeau You said you had a WIP patch to fix this on libvirt level. Do you think you'd be able to get that in, in the near future?

/assign @cfergeau

openshift-ci-robot commented 5 years ago

@zeenix: GitHub didn't allow me to assign the following users: cfergeau.

Note that only openshift members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/openshift/installer/issues/1007#issuecomment-506773726): >@cfergeau You said you had a WIP patch to fix this on libvirt level. Do you think you'd be able to get that in, in the near future? > >/assign @cfergeau Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
TuranTimur commented 5 years ago

Hi. I did the same but still error persist. Do I need to debug installer? or would there be any other pointer?

tail -f setup/.openshift_install.log time="2019-08-10T04:47:10+08:00" level=debug msg="Still waiting for the cluster to initialize: Multiple errors are preventing progress:\n Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (417 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (382 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-cluster-version/cluster-version-operator\" (6 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-controller-manager-operator/openshift-controller-manager-operator\" (421 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-image-registry/image-registry\" (388 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (398 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-kube-controller-manager-operator/kube-controller-manager-operator\" (402 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-kube-scheduler-operator/kube-scheduler-operator\" (406 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-machine-api/cluster-autoscaler-operator\" (144 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-machine-api/machine-api-operator\" (408 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-operator-lifecycle-manager/olm-operator\" (411 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (391 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator\" (394 of 422): the server does not recognize this resource, check extension API servers" time="2019-08-10T04:54:14+08:00" level=debug msg="Still waiting for the cluster to initialize: Multiple errors are preventing progress:\n Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (417 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (382 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-cluster-version/cluster-version-operator\" (6 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-controller-manager-operator/openshift-controller-manager-operator\" (421 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-image-registry/image-registry\" (388 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (398 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-kube-controller-manager-operator/kube-controller-manager-operator\" (402 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-kube-scheduler-operator/kube-scheduler-operator\" (406 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-machine-api/cluster-autoscaler-operator\" (144 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-machine-api/machine-api-operator\" (408 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-operator-lifecycle-manager/olm-operator\" (411 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (391 of 422): the server does not recognize this resource, check extension API servers\n Could not update servicemonitor \"openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator\" (394 of 422): the server does not recognize this resource, check extension API servers" time="2019-08-10T04:56:51+08:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-09-191209" time="2019-08-10T04:56:51+08:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-09-191209: downloading update" time="2019-08-10T04:56:56+08:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-09-191209" time="2019-08-10T04:57:11+08:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-09-191209: 19% complete" time="2019-08-10T04:57:22+08:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-09-191209: 82% complete" time="2019-08-10T04:57:38+08:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-09-191209: 95% complete" time="2019-08-10T05:00:27+08:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-09-191209: 95% complete" time="2019-08-10T05:01:40+08:00" level=fatal msg="failed to initialize the cluster: Working towards 4.2.0-0.okd-2019-08-09-191209: 95% complete"

zeenix commented 5 years ago

@donghwicha Your issue is unrelated to this one.

TuranTimur commented 5 years ago

thanks. I fixed already.

deanpeterson commented 5 years ago

Has anyone had luck with the work around posted in 90b0d45 recently?

I just did and except for the usual timeout issue, the cluster came up all good afaict.

I increased my timeouts to 90 minutes but still no luck even after applying this "workaround".

deanpeterson commented 5 years ago

I was finally successful. I made a video to help anyone else having a tough time getting through the install process: https://youtu.be/4mFMqNExRWk

zeenix commented 4 years ago

To fix this, we probably want/need to make use of the new libvirt mechanism to pass verbatim options to dnsmasq but to be able to do that, we need terraform support.

zeenix commented 4 years ago

Update: Turns out we can make use of the existing XSLT feature of terraform libvirt provider for this.

jichenjc commented 4 years ago

@zeenix I saw the issue closed in terraform side, so should we add some template in installer here or some other settings here?

zeenix commented 4 years ago

@jichenjc I was looking into this last week but w/o success yet. I've also heard that someone is working on this on the ingress operator level so I'll hold off my efforts for now.

ghost commented 4 years ago

Hi,

All my services are running ....

https://twitter.com/fabiosbano/status/1175842429641080832?s=09

Best Regards, Fabio Sbano

jichenjc commented 4 years ago

Thanks, @ssbano , I saw a picture and what kind of changes makes that happen? Thanks a lot

ghost commented 4 years ago

@jichenjc

I will write the steps performed

Best Regards, Fabio Sbano

ghost commented 4 years ago

@jichenjc

You can set dns (bind - bare metal) to resolve the * .apps.${domain} and i made some changes below

1)

[root@argon ~]# cat /etc/NetworkManager/dnsmasq.d/openshift.conf 
server=/jaguar.fsbano.com/192.168.126.1
server=/apps.jaguar.fsbano.com/172.27.15.30
[root@argon ~]# 

2) git diff

[root@argon installer]# git diff
diff --git a/cmd/openshift-install/create.go b/cmd/openshift-install/create.go
index 9021025b6..679649d1d 100644
--- a/cmd/openshift-install/create.go
+++ b/cmd/openshift-install/create.go
@@ -238,7 +238,7 @@ func waitForBootstrapComplete(ctx context.Context, config *rest.Config, director

        discovery := client.Discovery()

-       apiTimeout := 30 * time.Minute
+       apiTimeout := 60 * time.Minute
        logrus.Infof("Waiting up to %v for the Kubernetes API at %s...", apiTimeout, config.Host)
        apiContext, cancel := context.WithTimeout(ctx, apiTimeout)
        defer cancel()
@@ -279,7 +279,7 @@ func waitForBootstrapComplete(ctx context.Context, config *rest.Config, director
 // and waits for the bootstrap configmap to report that bootstrapping has
 // completed.
 func waitForBootstrapConfigMap(ctx context.Context, client *kubernetes.Clientset) error {
-       timeout := 30 * time.Minute
+       timeout := 60 * time.Minute
        logrus.Infof("Waiting up to %v for bootstrapping to complete...", timeout)

        waitCtx, cancel := context.WithTimeout(ctx, timeout)
@@ -317,7 +317,7 @@ func waitForBootstrapConfigMap(ctx context.Context, client *kubernetes.Clientset
 // waitForInitializedCluster watches the ClusterVersion waiting for confirmation
 // that the cluster has been initialized.
 func waitForInitializedCluster(ctx context.Context, config *rest.Config) error {
-       timeout := 30 * time.Minute
+       timeout := 60 * time.Minute
        logrus.Infof("Waiting up to %v for the cluster at %s to initialize...", timeout, config.Host)
        cc, err := configclient.NewForConfig(config)
        if err != nil {
diff --git a/data/data/libvirt/main.tf b/data/data/libvirt/main.tf
index 9ba88c9cf..152c78dd5 100644
--- a/data/data/libvirt/main.tf
+++ b/data/data/libvirt/main.tf
@@ -54,6 +54,11 @@ resource "libvirt_network" "net" {
   dns {
     local_only = true

+    forwarders { 
+        address = "172.27.15.30"
+        domain = "apps.${var.cluster_domain}"
+    }
+
     dynamic "srvs" {
       for_each = data.libvirt_network_dns_srv_template.etcd_cluster.*.rendered
       content {
diff --git a/data/data/libvirt/variables-libvirt.tf b/data/data/libvirt/variables-libvirt.tf
index 53cf68bae..79d1018e2 100644
--- a/data/data/libvirt/variables-libvirt.tf
+++ b/data/data/libvirt/variables-libvirt.tf
@@ -32,7 +32,7 @@ variable "libvirt_master_ips" {
 variable "libvirt_master_memory" {
   type        = string
   description = "RAM in MiB allocated to masters"
-  default     = "6144"
+  default     = "16384"
 }

 # At some point this one is likely to default to the number
diff --git a/pkg/asset/machines/libvirt/machines.go b/pkg/asset/machines/libvirt/machines.go
index 2ab6d9aa2..08847ab95 100644
--- a/pkg/asset/machines/libvirt/machines.go
+++ b/pkg/asset/machines/libvirt/machines.go
@@ -63,7 +63,7 @@ func provider(clusterID string, networkInterfaceAddress string, platform *libvir
                        APIVersion: "libvirtproviderconfig.openshift.io/v1beta1",
                        Kind:       "LibvirtMachineProviderConfig",
                },
-               DomainMemory: 7168,
+               DomainMemory: 16384,
                DomainVcpu:   4,
                Ignition: &libvirtprovider.Ignition{
                        UserDataSecret: userDataSecret,
[root@argon installer]# 
jichenjc commented 4 years ago

@ssbano thanks a lot !

I actually tried the /etc/NetworkManager/dnsmasq.d/openshift.conf change and seems that works for me ( at least console start up)..
can I know the purpose of following lines? Thanks

+    forwarders { 
+        address = "172.27.15.30"
+        domain = "apps.${var.cluster_domain}"
+    }
+
ghost commented 4 years ago

@jichenjc

I am using named for wildcard name resolution instead of dnsmasq

The ip address '172.27.15.30' is from my bind service physical machine

Best regards, Fábio Sbano

jichenjc commented 4 years ago

ok, thanks for the info ~

oswee commented 4 years ago

Similar issue signature there on 4.2. Interestingly same exact configs (i am using Ansible to set up it) was working only first time and now constantly fails at almost final stage. Authentication - degraded. Spending whole day to find out what could cause that. In my Bind ocp.example.com.zone i have *.apps IN A 192.168.1.254 where .254 is HAProxy LB with server infnod-0 infnod-0.ocp.example.com:443 check. So basically *.apps.ocp.example.com points to source balanced infra nodes.

frontend ocp-kubernetes-api-server
    mode tcp
    option tcplog
    bind api.ocp.example.com:6443
    default_backend ocp-kubernetes-api-server

backend ocp-kubernetes-api-server
    balance source
    mode tcp
    server boostrap-0 bootstrap-0.ocp.example.com:6443 check
    server master-0 master-0.ocp.example.com:6443 check
    server master-1 master-1.ocp.example.com:6443 check
    server master-2 master-2.ocp.example.com:6443 check

frontend ocp-machine-config-server
    bind api.ocp.example.com:22623
    default_backend ocp-machine-config-server
    mode tcp
    option tcplog

backend ocp-machine-config-server
    balance source
    mode tcp
    server bootstrap-0 bootstrap-0.ocp.example.com:22623 check
    server master-0 master-0.ocp.example.com:22623 check
    server master-1 master-1.ocp.example.com:22623 check
    server master-2 master-2.ocp.example.com:22623 check

frontend ocp-router-http
    bind apps.ocp.example.com:80
    default_backend ocp-router-http
    mode tcp
    option tcplog

backend ocp-router-http
    balance source
    mode tcp
    server infnod-0 infnod-0.ocp.example.com:80 check
    server infnod-1 infnod-1.ocp.example.com:80 check

frontend ocp-router-https
    bind apps.ocp.example.com:443
    default_backend ocp-router-https
    mode tcp
    option tcplog

backend ocp-router-https
    balance source
    mode tcp
    server infnod-0 infnod-0.ocp.example.com:443 check
    server infnod-1 infnod-1.ocp.example.com:443 check

It doesn't matter if i disable boostrap rules after bootstraping is done.

E1027 16:04:32.356766       1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: failed handling the route: route is not available at canonical host oauth-openshift.apps.ocp.example.com: []

If i ssh core@master-0.ocp.example.com and ping/dig oauth-openshift.apps.ocp.example.com i get an IP of LB node (.254).

image I don't know should infras be in this state at this point.

image

Before all this, i had issue with SeLinux on my LB machine because i was missing:

semanage port  -a 22623 -t http_port_t -p tcp
semanage port  -a 6443 -t http_port_t -p tcp
openshift-bot commented 4 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 4 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

clnperez commented 4 years ago

is there a way for @openshift-bot to give this immunity to becoming stale?

crawford commented 4 years ago

/remove-lifecycle rotten

This is still something we want to fix. There are just a surprisingly large number of pieces that are falling in place in the background that we need before we can tackle this.

ralvares commented 4 years ago

I might ask a stupid question but I'll ask anyway.

What is the reason of the option local_only = true when local_ony = false would fix this issue ?

local_only - (Optional) true/false: true means 'do not forward unresolved requests for this domain to the part DNS server

I ran the follow test:

sed -i 's/local_only = true/local_only = false/' /root/go/src/github.com/openshift/installer/data/data/libvirt/main.tf

TAGS=libvirt hack/build.sh mkdir /root/bin cp -rf /root/go/src/github.com/openshift/installer/bin/openshift-install /root/bin/

yum install dnsmasq

echo -e "[main]\ndns=dnsmasq" | sudo tee /etc/NetworkManager/conf.d/openshift.conf

echo listen-address=127.0.0.1 > /etc/NetworkManager/dnsmasq.d/openshift.conf echo bind-interfaces >> /etc/NetworkManager/dnsmasq.d/openshift.conf echo server=8.8.8.8 >> /etc/NetworkManager/dnsmasq.d/openshift.conf echo address=/apps.ocp.openshift.local/192.168.126.1 >> /etc/NetworkManager/dnsmasq.d/openshift.conf

systemctl reload NetworkManager

3x master 3x workers

and using a Container Loadbalancer

/usr/bin/podman run -d --name loadbalancer --net host \ -e API="bootstrap=192.168.126.10:6443,master-0=192.168.126.11:6443,master-1=192.168.126.12:6443,master-2=192.168.126.13:6443" \ -e API_LISTEN="0.0.0.0:6443" \ -e INGRESS_HTTP="worker-0=192.168.126.51:80,worker-1=192.168.126.52:80,worker-2=192.168.126.53:80" \ -e INGRESS_HTTP_LISTEN="0.0.0.0:80" \ -e INGRESS_HTTPS="worker-0=192.168.126.51:443,worker-1=192.168.126.52:443,worker-2=192.168.126.53:443" \ -e INGRESS_HTTPS_LISTEN="0.0.0.0:443" \ -e MACHINE_CONFIG_SERVER="bootstrap=192.168.126.10:22623,master-0=192.168.126.10:22623,master-1=192.168.126.11:22623,master-2=192.168.126.12:22623" \ -e MACHINE_CONFIG_SERVER_LISTEN="127.0.0.1:22623" \ quay.io/redhat-emea-ssa-team/openshift-4-loadbalancer

And the installation went well.

luisarizmendi commented 4 years ago

I use to solve this by changing the APPS URL to apps. instead of apps.. but, since I want to use the default APPs URL, I've also solved it by modifying data/data/libvirt/main.tf, but instead of changing the local_only, I added a forward entry just for apps.. domains to the libvirt network gateway, and in the KVM host where I have this configuration in dnsmasq managed by NetworkManager:

dns { local_only = true forwarders { address = "192.168.122.1" domain = "apps.$clustername .$basedomain" }

This is the KVM dnsmasq config: server=/$basedomain/192.168.126.1 address=/.apps.$clustername.$basedomain/192.168.126.1

Doing we maintain libvirt dnsmasq manage everything less the apps URL (that it wouldn't resolve because of this issue) that are forwarded to the KVM dnsmasq that actually works.

You can check my playbook that configure the kvm here: https://github.com/luisarizmendi/ocp-libvirt-ipi-role/blob/master/tasks/kvm_deploy.yml

And the playbook that change the data/data/libvirt/main.tf file here: https://github.com/luisarizmendi/ocp-libvirt-ipi-role/blob/master/tasks/ocp_deploy.yml

cfergeau commented 4 years ago

https://gitlab.com/libvirt/libvirt/-/commit/fb9f6ce625322d10b2e2a7c3ce4faab780b97e8d might be a way to add the needed options to libvirt dnsmasq instance, which would allow to do all the cluster-related name resolution on 192.168.126.1 rather than having to go through a second dnsmasq instance managed by NetworkManager.

ralvares commented 4 years ago

@cfergeau totally agreed, I did some tests on RHEL/CENTOS8 with libvirt 5.6 where libvirt manage all the DNS entries including *.apps.

https://github.com/RedHat-EMEA-SSA-Team/labs/tree/master/disk-encryption#creating-libvirt-network

Best Regards

samuelvl commented 4 years ago

To make the feature proposed by @ralvares work when using the Terraform provider for libvirt, the following XSLT transformation can be applied https://github.com/samuelvl/ocp4-disconnected-lab/blob/master/src/dns/libvirt-dns.xml.tpl

resource "libvirt_network" "openshift" {
  ...
  xml {
    xslt = data.template_file.openshift_libvirt_dns.rendered
  }
}
rthallisey commented 4 years ago

Here's a workaround: https://github.com/openshift/installer/issues/1648#issuecomment-585235423

clnperez commented 4 years ago

Last week while trying to do some basic verification I ran into an issue where the workaround listed in the installer troubleshooting doc wasn't working. We figured out it was due to the fact that I had spun up a cluster with three workers, but the ingress controller has 2 set in its replicaset. So neither of those pods landed on the. 51 worker -- and we saw the same symptoms as if no workaround had been applied. It doesn't look like there's a way to do wildcards and have multiple IPs for a host entry. dnsmasq seems to take the last entry in a file as the IP instead of do any kind of round-robin. Any suggestions? Or do we just need to edit the manifest for the ingress operator to create 3 replicas?

marshallford commented 3 years ago

@clnperez I'm running into the same issue. Did you manage to find a solve?

clnperez commented 3 years ago

@marshallford no, nothing other than spinning up that 3rd replica for the ingress.

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 3 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/installer/issues/1007#issuecomment-809085885): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.