ryanhay / ocp4-metal-install

Install OpenShift 4 on Bare Metal - UPI
209 stars 394 forks source link

Issue during boot on bootstrap and control plane #2

Closed AvRajath closed 3 years ago

AvRajath commented 3 years ago

Version $ openshift-install version ./openshift-install 4.5.9 built from commit 0d5c871ce7d03f3d03ab4371dc39916a5415cf5c release image quay.io/openshift-release-dev/ocp-release@sha256:7ad540594e2a667300dd2584fe2ede2c1a0b814ee6a62f60809d87ab564f4425 Platform: baremetal

UPI (semi-manual installation on customised infrastructure) What happened? Cluster details: Control planes on the baremetal and worker on baremetal. But the bootstrap is running on the ESXi server which is on the same network.

After i launch my boot strap and control nodes i can see this message for boot strap:

~/openshift-install --dir ~/ocp-install wait-for bootstrap-complete --log-level=debug DEBUG OpenShift Installer 4.5.9 DEBUG Built from commit 0d5c871 INFO Waiting up to 20m0s for the Kubernetes API at https://api.lab.ocp.lan:6443... DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF DEBUG Still waiting for the Kubernetes API: Get https://api.lab.ocp.lan:6443/version?timeout=32s: EOF

And on the Control plane nodes i can see the error as seen below image: image

May i know if i have missed something? i fell some issue with connectivity.

AvRajath commented 3 years ago

image

My haproxy status

ryanhay commented 3 years ago

Hey, are you able to ssh to the control plane nodes and have a look what you can see? I would check journalctl -xe and the status and logs of the pods crictl ps -a crictl logs <pod id>.

RobyYasirAmri commented 3 years ago

hi Ryanhay, I have a similar problem too, I can ssh to the control plane and can ping all nodes ssh cp ha

ryanhay commented 3 years ago

It's good that you can ping the nodes to prove connectivity but I don't think that is the problem. It looks like the machine config server isn't coming up and serving on 22623 for some reason and therefore results in an errors when trying to hit it. Can you see any errors when you run the commands in my previous comment on the bootstrap and cp nodes?

RobyYasirAmri commented 3 years ago

yes, I attach this log-bundle-20201016112311.zip

ryanhay commented 3 years ago

Search in those logs for '192.168.22.81', etcd isn't coming up. Maybe check that your MAC addresses are entered correctly in dhcpd.conf and that the static IPs are assigning correctly... the .81 means the bootstrap is getting an address in the DHCP pool.

AvRajath commented 3 years ago

@ryanhay Not sure why my bootstrap status isnt UP in the HAproxy Still facing the same issue.

I can login to bootstrap machine and see the pods. Here is what the logs are:

[core@ocp-bootstrap ~]$ journalctl -xe

-- Unit libpod-conmon-66c2122224afad31b3f1ce0c2b9555f8052cc29f18c61af350ef433c2ebb4254.scope has finished starting up.
--
-- The start-up result is done.
Oct 20 10:21:37 ocp-bootstrap.lab.ocp.lan systemd[1]: Started libcontainer container 66c2122224afad31b3f1ce0c2b9555f8052cc29f18c61af350ef433c2ebb4254.
-- Subject: Unit libpod-66c2122224afad31b3f1ce0c2b9555f8052cc29f18c61af350ef433c2ebb4254.scope has finished start-up
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- Unit libpod-66c2122224afad31b3f1ce0c2b9555f8052cc29f18c61af350ef433c2ebb4254.scope has finished starting up.
--
-- The start-up result is done.
Oct 20 10:21:37 ocp-bootstrap.lab.ocp.lan kernel: SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
Oct 20 10:21:38 ocp-bootstrap.lab.ocp.lan podman[104631]: 2020-10-20 10:21:38.06458143 +0000 UTC m=+0.287191049 container init 66c2122224afad31b3f1ce0c2b9555f8052cc29f18c61af350ef433c2ebb4254 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c98ade1e096fc9b26cd152dd9b1e9544aba7ff8659d72b971509c3f73eca909, name=etcdctl)
Oct 20 10:21:38 ocp-bootstrap.lab.ocp.lan podman[104631]: 2020-10-20 10:21:38.074435518 +0000 UTC m=+0.297045137 container start 66c2122224afad31b3f1ce0c2b9555f8052cc29f18c61af350ef433c2ebb4254 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c98ade1e096fc9b26cd152dd9b1e9544aba7ff8659d72b971509c3f73eca909, name=etcdctl)
Oct 20 10:21:38 ocp-bootstrap.lab.ocp.lan podman[104631]: 2020-10-20 10:21:38.074592272 +0000 UTC m=+0.297201901 container attach 66c2122224afad31b3f1ce0c2b9555f8052cc29f18c61af350ef433c2ebb4254 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c98ade1e096fc9b26cd152dd9b1e9544aba7ff8659d72b971509c3f73eca909, name=etcdctl)
Oct 20 10:21:41 ocp-bootstrap.lab.ocp.lan hyperkube[2499]: I1020 10:21:41.092415    2499 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach
Oct 20 10:21:41 ocp-bootstrap.lab.ocp.lan hyperkube[2499]: I1020 10:21:41.093825    2499 kubelet_node_status.go:486] Recording NodeHasSufficientMemory event message for node ocp-bootstrap.lab.ocp.lan
Oct 20 10:21:41 ocp-bootstrap.lab.ocp.lan hyperkube[2499]: I1020 10:21:41.093844    2499 kubelet_node_status.go:486] Recording NodeHasNoDiskPressure event message for node ocp-bootstrap.lab.ocp.lan
Oct 20 10:21:41 ocp-bootstrap.lab.ocp.lan hyperkube[2499]: I1020 10:21:41.093850    2499 kubelet_node_status.go:486] Recording NodeHasSufficientPID event message for node ocp-bootstrap.lab.ocp.lan
Oct 20 10:21:41 ocp-bootstrap.lab.ocp.lan approve-csr.sh[2521]: Unable to connect to the server: x509: certificate has expired or is not yet valid
Oct 20 10:21:43 ocp-bootstrap.lab.ocp.lan bootkube.sh[2520]: {"level":"warn","ts":"2020-10-20T10:21:43.079Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-22d08b33-4f86-46f0-ab69-b3a2960a8b22/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Oct 20 10:21:43 ocp-bootstrap.lab.ocp.lan bootkube.sh[2520]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Oct 20 10:21:43 ocp-bootstrap.lab.ocp.lan bootkube.sh[2520]: Error: unhealthy cluster
Oct 20 10:21:43 ocp-bootstrap.lab.ocp.lan podman[104631]: 2020-10-20 10:21:43.098021709 +0000 UTC m=+5.320631388 container died 66c2122224afad31b3f1ce0c2b9555f8052cc29f18c61af350ef433c2ebb4254 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c98ade1e096fc9b26cd152dd9b1e9544aba7ff8659d72b971509c3f73eca909, name=etcdctl)
Oct 20 10:21:43 ocp-bootstrap.lab.ocp.lan systemd[1]: libpod-66c2122224afad31b3f1ce0c2b9555f8052cc29f18c61af350ef433c2ebb4254.scope: Consumed 153ms CPU time
-- Subject: Resources consumed by unit runtime
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit libpod-66c2122224afad31b3f1ce0c2b9555f8052cc29f18c61af350ef433c2ebb4254.scope completed and consumed the indicated resources.
Oct 20 10:21:43 ocp-bootstrap.lab.ocp.lan podman[104631]: 2020-10-20 10:21:43.124137172 +0000 UTC m=+5.346746791 container remove 66c2122224afad31b3f1ce0c2b9555f8052cc29f18c61af350ef433c2ebb4254 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c98ade1e096fc9b26cd152dd9b1e9544aba7ff8659d72b971509c3f73eca909, name=etcdctl)
Oct 20 10:21:43 ocp-bootstrap.lab.ocp.lan bootkube.sh[2520]: etcdctl failed. Retrying in 5 seconds...
Oct 20 10:21:48 ocp-bootstrap.lab.ocp.lan podman[104714]: 2020-10-20 10:21:48.18550119 +0000 UTC m=+0.047834143 container create e2f3298d6a07e12677450bf42c79b2f19d470b325574104477dfd217a18cd4a0 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c98ade1e096fc9b26cd152dd9b1e9544aba7ff8659d72b971509c3f73eca909, name=etcdctl)
Oct 20 10:21:48 ocp-bootstrap.lab.ocp.lan systemd[1]: Started libpod-conmon-e2f3298d6a07e12677450bf42c79b2f19d470b325574104477dfd217a18cd4a0.scope.
-- Subject: Unit libpod-conmon-e2f3298d6a07e12677450bf42c79b2f19d470b325574104477dfd217a18cd4a0.scope has finished start-up
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- Unit libpod-conmon-e2f3298d6a07e12677450bf42c79b2f19d470b325574104477dfd217a18cd4a0.scope has finished starting up.
--
-- The start-up result is done.
Oct 20 10:21:48 ocp-bootstrap.lab.ocp.lan systemd[1]: Started libcontainer container e2f3298d6a07e12677450bf42c79b2f19d470b325574104477dfd217a18cd4a0.
-- Subject: Unit libpod-e2f3298d6a07e12677450bf42c79b2f19d470b325574104477dfd217a18cd4a0.scope has finished start-up
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- Unit libpod-e2f3298d6a07e12677450bf42c79b2f19d470b325574104477dfd217a18cd4a0.scope has finished starting up.
--
-- The start-up result is done.
Oct 20 10:21:48 ocp-bootstrap.lab.ocp.lan kernel: SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
Oct 20 10:21:48 ocp-bootstrap.lab.ocp.lan podman[104714]: 2020-10-20 10:21:48.402565267 +0000 UTC m=+0.264898249 container init e2f3298d6a07e12677450bf42c79b2f19d470b325574104477dfd217a18cd4a0 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c98ade1e096fc9b26cd152dd9b1e9544aba7ff8659d72b971509c3f73eca909, name=etcdctl)
Oct 20 10:21:48 ocp-bootstrap.lab.ocp.lan podman[104714]: 2020-10-20 10:21:48.413067151 +0000 UTC m=+0.275400113 container start e2f3298d6a07e12677450bf42c79b2f19d470b325574104477dfd217a18cd4a0 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c98ade1e096fc9b26cd152dd9b1e9544aba7ff8659d72b971509c3f73eca909, name=etcdctl)
Oct 20 10:21:48 ocp-bootstrap.lab.ocp.lan podman[104714]: 2020-10-20 10:21:48.413307352 +0000 UTC m=+0.275640384 container attach e2f3298d6a07e12677450bf42c79b2f19d470b325574104477dfd217a18cd4a0 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c98ade1e096fc9b26cd152dd9b1e9544aba7ff8659d72b971509c3f73eca909, name=etcdctl)
Oct 20 10:21:51 ocp-bootstrap.lab.ocp.lan hyperkube[2499]: I1020 10:21:51.104410    2499 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach
Oct 20 10:21:51 ocp-bootstrap.lab.ocp.lan hyperkube[2499]: I1020 10:21:51.105272    2499 kubelet_node_status.go:486] Recording NodeHasSufficientMemory event message for node ocp-bootstrap.lab.ocp.lan
Oct 20 10:21:51 ocp-bootstrap.lab.ocp.lan hyperkube[2499]: I1020 10:21:51.105295    2499 kubelet_node_status.go:486] Recording NodeHasNoDiskPressure event message for node ocp-bootstrap.lab.ocp.lan
Oct 20 10:21:51 ocp-bootstrap.lab.ocp.lan hyperkube[2499]: I1020 10:21:51.105303    2499 kubelet_node_status.go:486] Recording NodeHasSufficientPID event message for node ocp-bootstrap.lab.ocp.lan
Oct 20 10:21:53 ocp-bootstrap.lab.ocp.lan bootkube.sh[2520]: {"level":"warn","ts":"2020-10-20T10:21:53.419Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-c3ce7f46-ac4b-46b5-a3c2-062816bc5ca1/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Oct 20 10:21:53 ocp-bootstrap.lab.ocp.lan bootkube.sh[2520]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Oct 20 10:21:53 ocp-bootstrap.lab.ocp.lan bootkube.sh[2520]: Error: unhealthy cluster
Oct 20 10:21:53 ocp-bootstrap.lab.ocp.lan systemd[1]: libpod-e2f3298d6a07e12677450bf42c79b2f19d470b325574104477dfd217a18cd4a0.scope: Consumed 153ms CPU time
-- Subject: Resources consumed by unit runtime
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit libpod-e2f3298d6a07e12677450bf42c79b2f19d470b325574104477dfd217a18cd4a0.scope completed and consumed the indicated resources.
Oct 20 10:21:53 ocp-bootstrap.lab.ocp.lan podman[104714]: 2020-10-20 10:21:53.460844894 +0000 UTC m=+5.323177896 container died e2f3298d6a07e12677450bf42c79b2f19d470b325574104477dfd217a18cd4a0 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c98ade1e096fc9b26cd152dd9b1e9544aba7ff8659d72b971509c3f73eca909, name=etcdctl)
Oct 20 10:21:53 ocp-bootstrap.lab.ocp.lan podman[104714]: 2020-10-20 10:21:53.489428871 +0000 UTC m=+5.351761853 container remove e2f3298d6a07e12677450bf42c79b2f19d470b325574104477dfd217a18cd4a0 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c98ade1e096fc9b26cd152dd9b1e9544aba7ff8659d72b971509c3f73eca909, name=etcdctl)
Oct 20 10:21:53 ocp-bootstrap.lab.ocp.lan bootkube.sh[2520]: etcdctl failed. Retrying in 5 seconds...
~
AvRajath commented 3 years ago

More info

[root@ocp-bootstrap core]# crictl ps -a
CONTAINER           IMAGE                                                                                                                    CREATED             STATE               NAME                        ATTEMPT             POD ID
fd216955da615       2f0188c53bdac99e42c224973b16989a0194b1cc90e419f9edcf42bc5db1a9ed                                                         4 hours ago         Running             machine-config-server       0                   96043539de805
3c658e97538c4       2f0188c53bdac99e42c224973b16989a0194b1cc90e419f9edcf42bc5db1a9ed                                                         4 hours ago         Exited              machine-config-controller   0                   96043539de805
97b2f8d848e92       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f75aed7c3f79a365430271ae43cc88dcdc9f141400f9fe8c87341116a292137   4 hours ago         Running             certs                       0                   6306798c15d96