openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.44k stars 1.39k forks source link

Libvirt IPI: Workers not getting created on RHEL 8.6+ with virsh 8.0.0 #7004

Closed pratham-m closed 10 months ago

pratham-m commented 1 year ago

Version

$ openshift-install version
openshift-install unreleased-master-7855-gf11c21e5e98c0f19516a0c0b13b8350a8f636b36-dirty
built from commit f11c21e5e98c0f19516a0c0b13b8350a8f636b36
release image registry.ci.openshift.org/origin/release:4.13
release architecture ppc64le

Platform: Libvirt IPI

$ virsh version
Compiled against library: libvirt 8.0.0
Using library: libvirt 8.0.0
Using API: QEMU 8.0.0
Running hypervisor: QEMU 6.2.0

$ lsb_release -a
LSB Version:    :core-4.1-noarch:core-4.1-ppc64le
Distributor ID: RedHatEnterprise
Description:    Red Hat Enterprise Linux release 8.7 (Ootpa)
Release:        8.7
Codename:       Ootpa

What happened?

$ openshift-install create cluster --dir=$CLUSTER_DIR --log-level=debug fails while waiting for the worker nodes to come up.

level=info msg=Waiting up to 40m0s (until 10:10AM) for the cluster at https://api.ppc64le-qe53c.psi.redhat.com:6443/ to initialize...
level=debug msg=Still waiting for the cluster to initialize: Multiple errors are preventing progress:
level=debug msg=* Cluster operators authentication, image-registry, ingress, insights, kube-apiserver, kube-controller-manager, kube-scheduler, machine-api, monitoring, openshift-apiserver, openshift-controller-manager, openshift-samples, operator-lifecycle-manager-packageserver are not available
level=debug msg=* Could not update imagestream "openshift/driver-toolkit" (581 of 840): the server is down or not responding
level=debug msg=* Could not update oauthclient "console" (524 of 840): the server does not recognize this resource, check extension API servers
level=debug msg=* Could not update role "openshift-console-operator/prometheus-k8s" (757 of 840): resource may have been deleted
level=debug msg=* Could not update role "openshift-console/prometheus-k8s" (760 of 840): resource may have been deleted
level=debug msg=Still waiting for the cluster to initialize: Multiple errors are preventing progress:
level=debug msg=* Cluster operators authentication, image-registry, ingress, insights, kube-apiserver, kube-controller-manager, kube-scheduler, machine-api, monitoring, openshift-apiserver, openshift-controller-manager, openshift-samples, operator-lifecycle-manager-packageserver are not available
level=debug msg=* Could not update imagestream "openshift/driver-toolkit" (581 of 840): the server is down or not responding
level=debug msg=* Could not update oauthclient "console" (524 of 840): the server does not recognize this resource, check extension API servers
level=debug msg=* Could not update role "openshift-console-operator/prometheus-k8s" (757 of 840): resource may have been deleted
level=debug msg=* Could not update role "openshift-console/prometheus-k8s" (760 of 840): resource may have been deleted
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.13.0-rc.0
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.13.0-rc.0: 581 of 840 done (69% complete)
level=debug msg=Still waiting for the cluster to initialize: Cluster operators authentication, console, image-registry, ingress, kube-apiserver, machine-api, monitoring are not available
level=debug msg=Still waiting for the cluster to initialize: Cluster operators authentication, console, image-registry, ingress, kube-apiserver, machine-api, monitoring are not available
level=debug msg=Still waiting for the cluster to initialize: Cluster operators authentication, console, image-registry, ingress, kube-apiserver, machine-api, monitoring are not available
level=debug msg=Still waiting for the cluster to initialize: Multiple errors are preventing progress:
level=debug msg=* Cluster operators authentication, console, image-registry, ingress, kube-apiserver, machine-api, monitoring are not available
level=debug msg=* Could not update prometheusrule "openshift-cluster-version/cluster-version-operator" (11 of 840)
level=debug msg=* Could not update prometheusrule "openshift-etcd-operator/etcd-prometheus-rules" (769 of 840)
level=debug msg=* Could not update servicemonitor "openshift-console/console" (762 of 840)
level=debug msg=* Could not update servicemonitor "openshift-ingress-operator/ingress-operator" (773 of 840)
level=debug msg=* Could not update servicemonitor "openshift-operator-lifecycle-manager/olm-operator" (809 of 840)
level=debug msg=* Could not update servicemonitor "openshift-service-ca-operator/service-ca-operator" (831 of 840)
level=debug msg=Still waiting for the cluster to initialize: Cluster operators authentication, console, image-registry, ingress, machine-api, monitoring are not available
...
...
level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=error msg=failed to initialize the cluster: Cluster operators authentication, console, image-registry, ingress, machine-api, monitoring are not available
$ virsh list
 Id   Name                                 State
----------------------------------------------------
 15   ppc64le-qe53c-5pvdd-master-2         running
 16   ppc64le-qe53c-5pvdd-master-0         running
 17   ppc64le-qe53c-5pvdd-master-1         running
 19   ppc64le-qe53c-5pvdd-worker-0-7c6lw   running
 20   ppc64le-qe53c-5pvdd-worker-0-4vn7d   running
 21   ppc64le-qe53c-5pvdd-worker-0-kzwrk   running

$ oc get nodes
NAME                           STATUS   ROLES                  AGE   VERSION
ppc64le-qe53c-5pvdd-master-0   Ready    control-plane,master   44m   v1.26.2+06e8c46
ppc64le-qe53c-5pvdd-master-1   Ready    control-plane,master   44m   v1.26.2+06e8c46
ppc64le-qe53c-5pvdd-master-2   Ready    control-plane,master   44m   v1.26.2+06e8c46

$ oc get machines -A
NAMESPACE               NAME                                 PHASE          TYPE   REGION   ZONE   AGE
openshift-machine-api   ppc64le-qe53c-5pvdd-master-0         Running                               44m
openshift-machine-api   ppc64le-qe53c-5pvdd-master-1         Running                               44m
openshift-machine-api   ppc64le-qe53c-5pvdd-master-2         Running                               44m
openshift-machine-api   ppc64le-qe53c-5pvdd-worker-0-4vn7d   Provisioning                          41m
openshift-machine-api   ppc64le-qe53c-5pvdd-worker-0-7c6lw   Provisioning                          41m
openshift-machine-api   ppc64le-qe53c-5pvdd-worker-0-kzwrk   Provisioning                          41m

$ oc get machinesets -A
NAMESPACE               NAME                           DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   ppc64le-qe53c-5pvdd-worker-0   3         3                             44m

What you expected to happen?

OCP cluster creation should succeed and all worker nodes should come up. Expected o/p is as below:

$ oc get nodes
NAME                                 STATUS   ROLES    AGE    VERSION
ppc64le-qe53c-5pvdd-master-0         Ready    master   150m   v1.26.2+06e8c46
ppc64le-qe53c-5pvdd-master-1         Ready    master   150m   v1.26.2+06e8c46
ppc64le-qe53c-5pvdd-master-2         Ready    master   150m   v1.26.2+06e8c46
ppc64le-qe53c-5pvdd-worker-0-4vn7d   Ready    worker   145m   v1.26.2+06e8c46
ppc64le-qe53c-5pvdd-worker-0-7c6lw   Ready    worker   141m   v1.26.2+06e8c46
ppc64le-qe53c-5pvdd-worker-0-kzwrk   Ready    worker   145m   v1.26.2+06e8c46

How to reproduce it

Clone the repository and install pre-requisites as per https://github.com/openshift/installer/tree/master/docs/dev/libvirt#libvirt-howto

$ TAGS=libvirt DEFAULT_ARCH=ppc64le hack/build.sh
$ openshift-install --dir=$CLUSTER_DIR create install-config
$ openshift-install --dir=$CLUSTER_DIR create manifests
$ openshift-install --dir=$CLUSTER_DIR create cluster --log-level=debug

Anything else we need to know?

Issue is not specific to any OCP version and is re-producible on 4.12.x, 4.11.x, etc. Same steps work fine on RHEL 8.5 with Virsh 6.0.0

References

Below issues might not be related, but seems to have few similarities.

pratham-m commented 1 year ago
$ cat install-config.yaml
apiVersion: v1
baseDomain: psi.redhat.com
compute:
- architecture: ppc64le
  hyperthreading: Enabled
  name: worker
  replicas: 3
controlPlane:
  architecture: ppc64le
  hyperthreading: Enabled
  name: master
  replicas: 3
metadata:
  creationTimestamp: null
  name: ppc64le-qe53c
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 192.168.128.0/24
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  libvirt:
    network:
      if: tt2
pullSecret: .........
sshKey: ............
pratham-m commented 1 year ago
$ oc --namespace=openshift-machine-api get deployments
NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
cluster-autoscaler-operator          1/1     1            1           113d
cluster-baremetal-operator           1/1     1            1           113d
control-plane-machine-set-operator   1/1     1            1           113d
machine-api-controllers              1/1     1            1           113d
machine-api-operator                 1/1     1            1           113d

$  oc --namespace=openshift-machine-api logs deployments/machine-api-controllers --container=machine-controller
...
...
I0321 06:14:03.212356       1 controller.go:187] ppc64le-qe6a-mr77v-master-2: reconciling Machine
I0321 06:14:03.212429       1 actuator.go:224] Checking if machine ppc64le-qe6a-mr77v-master-2 exists.
I0321 06:14:03.214188       1 client.go:142] Created libvirt connection: 0xc000640e68
I0321 06:14:03.214738       1 client.go:317] Check if "ppc64le-qe6a-mr77v-master-2" domain exists
I0321 06:14:03.215200       1 client.go:158] Freeing the client pool
I0321 06:14:03.215357       1 client.go:164] Closing libvirt connection: 0xc000640e68
I0321 06:14:03.215807       1 controller.go:313] ppc64le-qe6a-mr77v-master-2: reconciling machine triggers idempotent update
I0321 06:14:03.215858       1 actuator.go:189] Updating machine ppc64le-qe6a-mr77v-master-2
I0321 06:14:03.218036       1 client.go:142] Created libvirt connection: 0xc000641158
I0321 06:14:03.218356       1 client.go:302] Lookup domain by name: "ppc64le-qe6a-mr77v-master-2"
I0321 06:14:03.218688       1 actuator.go:364] Updating status for ppc64le-qe6a-mr77v-master-2
I0321 06:14:03.220932       1 client.go:158] Freeing the client pool
I0321 06:14:03.221011       1 client.go:164] Closing libvirt connection: 0xc000641158
I0321 06:14:03.229977       1 controller.go:187] ppc64le-qe6a-mr77v-worker-0-65rjs: reconciling Machine
I0321 06:14:03.230057       1 actuator.go:224] Checking if machine ppc64le-qe6a-mr77v-worker-0-65rjs exists.
I0321 06:14:03.232083       1 client.go:142] Created libvirt connection: 0xc000818e18
I0321 06:14:03.232471       1 client.go:317] Check if "ppc64le-qe6a-mr77v-worker-0-65rjs" domain exists
I0321 06:14:03.232807       1 client.go:158] Freeing the client pool
I0321 06:14:03.232853       1 client.go:164] Closing libvirt connection: 0xc000818e18
I0321 06:14:03.233232       1 controller.go:313] ppc64le-qe6a-mr77v-worker-0-65rjs: reconciling machine triggers idempotent update
I0321 06:14:03.233271       1 actuator.go:189] Updating machine ppc64le-qe6a-mr77v-worker-0-65rjs
I0321 06:14:03.234835       1 client.go:142] Created libvirt connection: 0xc0008190d8
I0321 06:14:03.235255       1 client.go:302] Lookup domain by name: "ppc64le-qe6a-mr77v-worker-0-65rjs"
I0321 06:14:03.235568       1 actuator.go:364] Updating status for ppc64le-qe6a-mr77v-worker-0-65rjs
I0321 06:14:03.237846       1 client.go:158] Freeing the client pool
I0321 06:14:03.237968       1 client.go:164] Closing libvirt connection: 0xc0008190d8
I0321 06:14:03.248101       1 controller.go:187] ppc64le-qe6a-mr77v-worker-0-h7b2s: reconciling Machine
I0321 06:14:03.248126       1 actuator.go:224] Checking if machine ppc64le-qe6a-mr77v-worker-0-h7b2s exists.
I0321 06:14:03.250372       1 client.go:142] Created libvirt connection: 0xc000c54988
I0321 06:14:03.250726       1 client.go:317] Check if "ppc64le-qe6a-mr77v-worker-0-h7b2s" domain exists
I0321 06:14:03.251060       1 client.go:158] Freeing the client pool
I0321 06:14:03.251087       1 client.go:164] Closing libvirt connection: 0xc000c54988
I0321 06:14:03.251454       1 controller.go:313] ppc64le-qe6a-mr77v-worker-0-h7b2s: reconciling machine triggers idempotent update
I0321 06:14:03.251466       1 actuator.go:189] Updating machine ppc64le-qe6a-mr77v-worker-0-h7b2s
I0321 06:14:03.253286       1 client.go:142] Created libvirt connection: 0xc000c54c48
I0321 06:14:03.253599       1 client.go:302] Lookup domain by name: "ppc64le-qe6a-mr77v-worker-0-h7b2s"
I0321 06:14:03.253933       1 actuator.go:364] Updating status for ppc64le-qe6a-mr77v-worker-0-h7b2s
I0321 06:14:03.256177       1 client.go:158] Freeing the client pool
I0321 06:14:03.256196       1 client.go:164] Closing libvirt connection: 0xc000c54c48
openshift-bot commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

pratham-m commented 1 year ago

/remove-lifecycle stale

openshift-bot commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

cfergeau commented 1 year ago

Since this happens on ppc64le, this is most likely https://issues.redhat.com/browse/OCPBUGS-17476, which is caused by a regression in SLOF. While this is being fixed, we can add a workaround in cluster-api-provider-libvirt

dale-fu commented 1 year ago

We have also seen this on s390x before, we are stuck using a specific version of libvirt, libvirt-6.0.0-37.module+el8.5.0+12162+40884dd2, since any thing later didn't seem to work for libvirt ipi installation.

pratham-m commented 1 year ago

/remove-lifecycle stale

pratham-m commented 1 year ago

Related to https://github.com/openshift/cluster-api-provider-libvirt/pull/263

openshift-bot commented 11 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale