okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.76k stars 297 forks source link

[4.10] Baremetal IPI: install stuck on baremetal operator "Applying metal3 resources" #1226

Closed valumar closed 3 months ago

valumar commented 2 years ago

Describe the bug

There were an error during baremetal IPI install: metal3 deployment in openshift-machine-api namespace could not be started properly for some reason. Version 4.10.0-0.okd-2022-05-07-021833

Baremetal IPI

How reproducible

100%

Log bundle

Please see ironic-rhcos-downloader issue: https://github.com/openshift/ironic-rhcos-downloader/issues/74

oc get co

NAME                                       VERSION                          AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
baremetal                                  4.10.0-0.okd-2022-05-07-021833   True        True          False      2d      Applying metal3 resources
ingress                                    4.10.0-0.okd-2022-05-07-021833   True        False         True       47h     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
machine-config                                                              True        True          True       47h     Unable to apply 4.10.0-0.okd-2022-05-07-021833: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3)

oc get mcp

NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master                                                      False     True       True       3              0                   0                     3                      2d
worker   rendered-worker-464f1f9d705ed586f7444f98ed11e536   True      False      False      0              0                   0                     0                      2d

oc get pod -n openshift-machine-api

NAME                                           READY   STATUS                  RESTARTS          AGE
cluster-autoscaler-operator-7d464b5849-ljr4w   2/2     Running                 2 (2d ago)        2d
cluster-baremetal-operator-d5c54dfc4-hz656     2/2     Running                 0                 2d
machine-api-controllers-7f94785cbf-t5fg7       7/7     Running                 4 (2d ago)        2d
machine-api-operator-864797b4dc-vpwwn          2/2     Running                 1 (2d ago)        2d
metal3-f4f44f6f7-8v68z                         0/7     Init:CrashLoopBackOff   513 (2m4s ago)    47h
metal3-image-cache-4h4x5                       0/1     Init:1/2                271 (7m58s ago)   47h
metal3-image-cache-cvffn                       0/1     Init:1/2                271 (7m54s ago)   47h
metal3-image-cache-t77wk                       0/1     Init:1/2                272 (15s ago)     47h
metal3-image-customization-795c979fcb-2g6q2    1/1     Running                 0                 47h

log output of metal3-machine-os-downloader init-container in metal3 pod:

+ export http_proxy=
+ http_proxy=
+ export https_proxy=
+ https_proxy=
+ export no_proxy=
+ no_proxy=
+ export CURL_CA_BUNDLE=/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
+ CURL_CA_BUNDLE=/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
+ export IP_OPTIONS=ip=dhcp
+ IP_OPTIONS=ip=dhcp
+ export 'RHCOS_IMAGE_URL=http://10.xxx.ccc.3:8080/fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.xz?sha256=5e3e40723288aa56735caa5e5fcb58079da5b27df392ee185a09f9b6742fa93f'
+ RHCOS_IMAGE_URL='http://10.xxx.ccc.3:8080/fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.xz?sha256=5e3e40723288aa56735caa5e5fcb58079da5b27df392ee185a09f9b6742fa93f'
+ '[' -z 'http://10.xxx.ccc.3:8080/fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.xz?sha256=5e3e40723288aa56735caa5e5fcb58079da5b27df392ee185a09f9b6742fa93f' ']'
++ echo 'http://10.xxx.ccc.3:8080/fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.xz?sha256=5e3e40723288aa56735caa5e5fcb58079da5b27df392ee185a09f9b6742fa93f'
++ cut -f 1 -d '?'
+ RHCOS_IMAGE_URL_STRIPPED=http://10.xxx.ccc.3:8080/fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.xz
+ [[ http://10.xxx.ccc.3:8080/fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.xz =~ qcow2(\.[gx]z)?$ ]]
++ basename http://10.xxx.ccc.3:8080/fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.xz
+ RHCOS_IMAGE_FILENAME_RAW=fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.xz
+ RHCOS_IMAGE_FILENAME_QCOW=fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2
+ IMAGE_FILENAME_EXTENSION=.xz
++ dirname http://10.xxx.ccc.3:8080/fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.xz
+ IMAGE_URL=http://10.xxx.ccc.3:8080
+ RHCOS_IMAGE_FILENAME_COMPRESSED=fedora-coreos-35.20220327.3.0-compressed.x86_64.qcow2
+ RHCOS_IMAGE_FILENAME_CACHED=cached-fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2
+ FFILENAME=rhcos-ootpa-latest.qcow2
+ mkdir -p /shared/html/images
+ [[ fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2 == *\-\o\p\e\n\s\t\a\c\k* ]]
+ [[ -s /shared/html/images/fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2/fedora-coreos-35.20220327.3.0-compressed.x86_64.qcow2.md5sum ]]
+ mkdir -p /shared/tmp
++ mktemp -d -p /shared/tmp
+ TMPDIR=/shared/tmp/tmp.kZQHYAz4Js
+ trap 'rm -fr /shared/tmp/tmp.kZQHYAz4Js' EXIT
+ cd /shared/tmp/tmp.kZQHYAz4Js
+ clearproxy http://10.xxx.ccc.3:8080/fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.xz
+ unset HTTP_PROXY http_proxy HTTPS_PROXY https_proxy
+ '[' -s /shared/html/images/fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2/cached-fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.md5sum ']'
+ CONNECT_TIMEOUT=120
+ MAX_ATTEMPTS=5
++ seq 5
+ for i in $(seq ${MAX_ATTEMPTS})
+ curl -v -g --compressed -L --fail --connect-timeout 120 -o fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.xz http://10.xxx.ccc.3:8080/fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.xz
*   Trying 10.xxx.ccc.3...
* TCP_NODELAY set
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 10.xxx.ccc.3 (10.xxx.ccc.3) port 8080 (#0)
> GET /fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.xz HTTP/1.1
> Host: 10.xxx.ccc.3:8080
> User-Agent: curl/7.61.1
> Accept: */*
> Accept-Encoding: deflate, gzip, br
> 
< HTTP/1.1 200 OK
< Date: Mon, 16 May 2022 12:02:41 GMT
< Server: Apache/2.4.34 (Red Hat) OpenSSL/1.0.2k-fips
< Last-Modified: Thu, 12 May 2022 08:33:51 GMT
< ETag: "29280ec0-5decc6adfbbd4"
< Accept-Ranges: bytes
< Content-Length: 690491072
< Content-Type: application/x-xz
< 
{ [2628 bytes data]

 90  658M   90  594M    0     0  1038M      0 --:--:-- --:--:-- --:--:-- 1036M
100  658M  100  658M    0     0  1033M      0 --:--:-- --:--:-- --:--:-- 1032M
* Connection #0 to host 10.xxx.ccc.3 left intact
+ break
+ [[ .xz == .gz ]]
+ [[ .xz == .xz ]]
+ unxz fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2.xz
+ '[' -n ip=dhcp ']'
++ LIBGUESTFS_BACKEND=direct
++ virt-filesystems -a fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2 -l
++ grep boot
++ cut -f1 '-d '
+ BOOT_DISK=/dev/sda3
+ LIBGUESTFS_BACKEND=direct
+ virt-edit -a fedora-coreos-35.20220327.3.0-openstack.x86_64.qcow2 -m /dev/sda3 /boot/loader/entries/ostree-1-rhcos.conf -e 's/^options/options ip=dhcp/'
libguestfs: error: download: /boot/loader/entries/ostree-1-rhcos.conf: No such file or directory
+ rm -fr /shared/tmp/tmp.kZQHYAz4Js
openshift-bot commented 2 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale