okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.72k stars 295 forks source link

4.11.0-0.okd-2022-11-05-030711 install attempt on azure fails #1405

Closed gurolakman closed 1 month ago

gurolakman commented 1 year ago

openshift-install create cluster fails [timeout in 20m] after printing the following message:

DEBUG Still waiting for the Kubernetes API: Get "https://api.okd.telenity.com:6443/version": dial tcp: lookup api.okd.telenity.com on 127.0.0.53:53: no such host

noticed strange complaints in as follows:

Nov 15 00:43:28 okd-nv7ln-bootstrap etcdctl[2124]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded Nov 15 00:43:28 okd-nv7ln-bootstrap bootkube.sh[2060]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded Nov 15 00:43:28 okd-nv7ln-bootstrap bootkube.sh[2060]: Error: unhealthy cluster Nov 15 00:43:28 okd-nv7ln-bootstrap etcdctl[2124]: Error: unhealthy cluster

don't believe the timeout is caused by dns misconfiguration since the logs contain the following entries:

Nov 15 00:43:40 okd-nv7ln-bootstrap cluster-bootstrap[2541]: Waiting up to 20m0s for the Kubernetes API Nov 15 00:43:40 okd-nv7ln-bootstrap bootkube.sh[2483]: Waiting up to 20m0s for the Kubernetes API Nov 15 00:43:41 okd-nv7ln-bootstrap cluster-bootstrap[2541]: Still waiting for the Kubernetes API: Get "https://localhost:6443/readyz": dial tcp [::1]:6443: connect: connection refused Nov 15 00:43:41 okd-nv7ln-bootstrap bootkube.sh[2483]: Still waiting for the Kubernetes API: Get "https://localhost:6443/readyz": dial tcp [::1]:6443: connect: connection refused Nov 15 00:43:57 okd-nv7ln-bootstrap cluster-bootstrap[2541]: API is up Nov 15 00:43:57 okd-nv7ln-bootstrap bootkube.sh[2483]: API is up

i'd appreciate any feedback on possible root cause of this misbehavior.

okd release: 4.11.0-0.okd-2022-11-05-030711 os image: fedora-coreos-36.20220716.3.1-azure.x86_64.vhd installer-provisioned infra (azure)

problem is reproducible 100% using:

$ openshift-install create cluster --log-level=debug --dir .

attached pls find the log bundle. many thanks in advance!

log-bundle-20221115010854.tar.gz

marcin-zajac commented 1 year ago

I have the same error when installing on baremetal

from your logs: DeadlineExceeded

{"level":"warn","ts":"2022-11-15T01:06:28.467Z","logger":"etcd-client","caller":"v3@v3.5.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00075bdc0/#initially=[https://172.24.36.6:2379;https://172.24.36.7:2379;https://172.24.36.8:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.24.36.6:2379: connect: connection refused\""}

It's look like problem with RedHat pull secret 🤬 . On my installation I have new one (originally secret should expire after 24h), but I have same error like you.

When I try use fake secret {"auths":{"fake":{"auth":"aWQ6cGFzcwo="}}} (following by OKD docs) I have this:

Nov 17 18:39:19 okd4-bootstrap bootkube.sh[64185]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e9791918453054e5290fd40ef78384fdb1ad449f0c41db974e90ed45f4dd40f2: reading manifest sha256:e9791918453054e5290fd40ef78384fdb1ad449f0c41db974e90ed45f4dd40f2 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
Nov 17 18:39:19 okd4-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=125/n/a
Nov 17 18:39:19 okd4-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.

I think it's the same problem

gurolakman commented 1 year ago

@marcin-zajac thanks for your note! i'll get a new pull secret & retry over the weekend to see if things work better. i'll update the thread accordingly. regards.

gurolakman commented 1 year ago

@marcin-zajac i downloaded a new pull secret and tried okd cluster installation one more time. not much seems to have changed in appearance. i'm seeing the following failures on my master nodes:

`[core@okd-9zbvd-master-0 ~]$ systemctl status openshift-azure-routes.service × openshift-azure-routes.service - Work around Azure load balancer hairpin Loaded: loaded (/etc/systemd/system/openshift-azure-routes.service; static) Active: failed (Result: start-limit-hit) since Sat 2022-11-19 17:59:31 UTC; 1h 33min ago TriggeredBy: × openshift-azure-routes.path Main PID: 2230 (code=exited, status=0/SUCCESS) CPU: 28ms

Nov 19 17:59:31 okd-9zbvd-master-0 systemd[1]: Started openshift-azure-routes.service - Work around Azure load balancer hairpin. Nov 19 17:59:31 okd-9zbvd-master-0 openshift-azure-routes[2230]: done applying vip rules Nov 19 17:59:31 okd-9zbvd-master-0 systemd[1]: openshift-azure-routes.service: Deactivated successfully. Nov 19 17:59:31 okd-9zbvd-master-0 systemd[1]: openshift-azure-routes.service: Start request repeated too quickly. Nov 19 17:59:31 okd-9zbvd-master-0 systemd[1]: openshift-azure-routes.service: Failed with result 'start-limit-hit'. Nov 19 17:59:31 okd-9zbvd-master-0 systemd[1]: Failed to start openshift-azure-routes.service - Work around Azure load balancer hairpin. [core@okd-9zbvd-master-0 ~]$ systemctl status openshift-azure-routes.path × openshift-azure-routes.path - Watch for downfile changes Loaded: loaded (/etc/systemd/system/openshift-azure-routes.path; enabled; vendor preset: disabled) Active: failed (Result: unit-start-limit-hit) since Sat 2022-11-19 17:59:31 UTC; 1h 34min ago Triggers: ● openshift-azure-routes.service

Nov 19 17:59:12 okd-9zbvd-master-0 systemd[1]: Started openshift-azure-routes.path - Watch for downfile changes. Nov 19 17:59:31 okd-9zbvd-master-0 systemd[1]: openshift-azure-routes.path: Failed with result 'unit-start-limit-hit'. ` interestingly enough, all 3 master and 3 worker nodes appear to be up and running per oc output:

NAME STATUS ROLES AGE VERSION okd-9zbvd-master-0 Ready master 103m v1.24.6+5658434 okd-9zbvd-master-1 Ready master 102m v1.24.6+5658434 okd-9zbvd-master-2 Ready master 103m v1.24.6+5658434 okd-9zbvd-worker-westeurope1-lxk2g Ready worker 89m v1.24.6+5658434 okd-9zbvd-worker-westeurope2-9ll62 Ready worker 89m v1.24.6+5658434 okd-9zbvd-worker-westeurope3-ndccc Ready worker 90m v1.24.6+5658434 per okd documentation, i open up tcp ports (6443, 22623) on the bootstrap and master nodes whereas tcp ports (80, 443) on the worker nodes in related network security groups prior to the installation.

i'm attaching the log bundle from my last installation attempt. thanks!

log-bundle-20221119181803.tar.gz

JaimeMagiera commented 1 month ago

Hi,

We are not working on FCOS builds of OKD any more. Please see these documents...

https://okd.io/blog/2024/06/01/okd-future-statement https://okd.io/blog/2024/07/30/okd-pre-release-testing

Please test with the OKD SCOS nightlies and file a new issue as needed.

Many thanks,

Jaime