okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.71k stars 295 forks source link

metal3 pod crash on baremetal 4.16.0-0.okd-scos-2024-08-21-155613 #2030

Open snehring opened 1 day ago

snehring commented 1 day ago

Describe the bug metal3 pod in CrashLoopBackoff due to failure in metal3-ironic-inspector. Seems very similar to bug described in OCPBUGS-32304

Version 4.16.0-0.okd-scos-2024-08-21-155613 baremetal ipi

How reproducible It's happened on two clusters I've set up since 4.16.0-0.okd-scos-2024-08-21-155613 became available.

Log bundle

+ CONFIG=/etc/ironic-inspector/ironic-inspector.conf
+ export IRONIC_INSPECTOR_ENABLE_DISCOVERY=false
+ IRONIC_INSPECTOR_ENABLE_DISCOVERY=false
+ export INSPECTOR_REVERSE_PROXY_SETUP=true
+ INSPECTOR_REVERSE_PROXY_SETUP=true
+ . /bin/tls-common.sh
++ export IRONIC_CERT_FILE=/certs/ironic/tls.crt
++ IRONIC_CERT_FILE=/certs/ironic/tls.crt
++ export IRONIC_KEY_FILE=/certs/ironic/tls.key
++ IRONIC_KEY_FILE=/certs/ironic/tls.key
++ export IRONIC_CACERT_FILE=/certs/ca/ironic/tls.crt
++ IRONIC_CACERT_FILE=/certs/ca/ironic/tls.crt
++ export IRONIC_INSECURE=true
++ IRONIC_INSECURE=true
++ export 'IRONIC_SSL_PROTOCOL=-ALL +TLSv1.2 +TLSv1.3'
++ IRONIC_SSL_PROTOCOL='-ALL +TLSv1.2 +TLSv1.3'
++ export 'IPXE_SSL_PROTOCOL=-ALL +TLSv1.2 +TLSv1.3'
++ IPXE_SSL_PROTOCOL='-ALL +TLSv1.2 +TLSv1.3'
++ export IRONIC_VMEDIA_SSL_PROTOCOL=ALL
++ IRONIC_VMEDIA_SSL_PROTOCOL=ALL
++ export IRONIC_INSPECTOR_CERT_FILE=/certs/ironic-inspector/tls.crt
++ IRONIC_INSPECTOR_CERT_FILE=/certs/ironic-inspector/tls.crt
++ export IRONIC_INSPECTOR_KEY_FILE=/certs/ironic-inspector/tls.key
++ IRONIC_INSPECTOR_KEY_FILE=/certs/ironic-inspector/tls.key
++ export IRONIC_INSPECTOR_CACERT_FILE=/certs/ca/ironic-inspector/tls.crt
++ IRONIC_INSPECTOR_CACERT_FILE=/certs/ca/ironic-inspector/tls.crt
++ export IRONIC_INSPECTOR_INSECURE=true
++ IRONIC_INSPECTOR_INSECURE=true
++ export IRONIC_VMEDIA_CERT_FILE=/certs/vmedia/tls.crt
++ IRONIC_VMEDIA_CERT_FILE=/certs/vmedia/tls.crt
++ export IRONIC_VMEDIA_KEY_FILE=/certs/vmedia/tls.key
++ IRONIC_VMEDIA_KEY_FILE=/certs/vmedia/tls.key
++ export IPXE_CERT_FILE=/certs/ipxe/tls.crt
++ IPXE_CERT_FILE=/certs/ipxe/tls.crt
++ export IPXE_KEY_FILE=/certs/ipxe/tls.key
++ IPXE_KEY_FILE=/certs/ipxe/tls.key
++ export RESTART_CONTAINER_CERTIFICATE_UPDATED=false
++ RESTART_CONTAINER_CERTIFICATE_UPDATED=false
++ export MARIADB_CACERT_FILE=/certs/ca/mariadb/tls.crt
++ MARIADB_CACERT_FILE=/certs/ca/mariadb/tls.crt
++ export IPXE_TLS_PORT=8084
++ IPXE_TLS_PORT=8084
++ mkdir -p /certs/ironic
++ mkdir -p /certs/ironic-inspector
++ mkdir -p /certs/ca/ironic
mkdir: cannot create directory '/certs/ca/ironic': Permission denied
snehring commented 1 day ago

Actually the issue seems to be a little different. 1002 and 1004 aren't the uid and gid of ironic and ironic-inspector in the container image

sh-5.1$ id ironic
uid=997(ironic) gid=995(ironic) groups=995(ironic)
sh-5.1$ id ironic-inspector
uid=996(ironic-inspector) gid=994(ironic-inspector) groups=994(ironic-inspector)
sh-5.1$ ls -lan /certs/ca
total 0
drwxrwsr-x. 2 997 994  6 Jun 11 10:39 .
drwxrwsr-x. 1 997 994 44 Sep 17 19:50 ..
sh-5.1$ id
uid=1002(1002) gid=1004 groups=1004

so the permission errors make sense

snehring commented 1 day ago

I think the issue lies with prepare-image.sh per the image manifest the package list file created for okd is called main-packages-list.okd instead of main-packages-list.ocp

snehring commented 1 day ago

I think I've got the root of the problem figured out and put in https://github.com/openshift/ironic-image/pull/581