Bare Metal UPI (libvirt): Etcd bootstrap looping on errors

cgruver commented 4 years ago

Hey everyone,

I am attempting an OKD 4.4 UPI install using pre-allocated libvirt guests, iPXE, and virtual BMC.

I followed guides in the following documents:

https://github.com/openshift/okd/blob/master/Documentation/UPI/libvirt/libvirt.md https://github.com/openshift/installer/blob/master/docs/user/metal/install_upi.md

The guests are initially booting via iPXE, getting reserved IPs with the appropriate DNS A and PTR records, and loading the appropriate ignition configs based on their MAC address.

I have an haproxy server set up for load balancing as described in the UPI docs.

Guests have 20GB of RAM, 4 vCPU, and 100GB /dev/sda

OKD Release: 4.4.0-0.okd-2020-02-28-211838 FCOS Release: fedora-coreos-31.20200210.3.0

Install Config:

apiVersion: v1
baseDomain: my.cluster.domain
metadata:
  name: okd4
networking:
  networkType: OpenShiftSDN
  clusterNetwork:
  - cidr: 10.100.0.0/14 
    hostPrefix: 23 
  serviceNetwork: 
  - 172.30.0.0/16
  machineNetwork:
  - cidr: 10.11.11.0/24
compute:
- name: worker
  replicas: 3
controlPlane:
  name: master
  replicas: 3
platform:
  none: {}
pullSecret: '{"auths": {"quay.io": {"auth": "Y2dydXZlcjpVci9REDACTED", "email": ""}}}'
sshKey: ssh-rsa AAAAB3NzaC1ycREDACTED root@my-bastion-host

FCOS is booting just fine. The ignition configs load on bootstrap, master, and worker nodes, and the bootstrap node begins to install the cluster.

Eventually the bootstrap bootkube.service logs begin looping on the following:

Mar 02 00:14:19 okd4-bootstrap bootkube.sh[747]: etcdctl failed. Retrying in 5 seconds...
Mar 02 00:14:24 okd4-bootstrap podman[30908]: 2020-03-02 00:14:24.438036201 +0000 UTC m=+0.108530107 container create c7c8f8689857dc5b071c3b06ee47e042f5feb40dec1995b61f1355b8ce39f624 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:24 okd4-bootstrap podman[30908]: 2020-03-02 00:14:24.659523512 +0000 UTC m=+0.330017430 container init c7c8f8689857dc5b071c3b06ee47e042f5feb40dec1995b61f1355b8ce39f624 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:24 okd4-bootstrap podman[30908]: 2020-03-02 00:14:24.697090928 +0000 UTC m=+0.367584827 container start c7c8f8689857dc5b071c3b06ee47e042f5feb40dec1995b61f1355b8ce39f624 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:24 okd4-bootstrap podman[30908]: 2020-03-02 00:14:24.697251096 +0000 UTC m=+0.367745021 container attach c7c8f8689857dc5b071c3b06ee47e042f5feb40dec1995b61f1355b8ce39f624 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:29 okd4-bootstrap bootkube.sh[747]: {"level":"warn","ts":"2020-03-02T00:14:29.680Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-37e70e11-22de-416a-98e0-c9d8c03ec97b/10.11.11.99:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.11.11.99:2379: connect: connection refused\""}
Mar 02 00:14:29 okd4-bootstrap bootkube.sh[747]: https://10.11.11.99:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Mar 02 00:14:29 okd4-bootstrap bootkube.sh[747]: Error: unhealthy cluster
Mar 02 00:14:29 okd4-bootstrap podman[30908]: 2020-03-02 00:14:29.729785554 +0000 UTC m=+5.400279462 container died c7c8f8689857dc5b071c3b06ee47e042f5feb40dec1995b61f1355b8ce39f624 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:29 okd4-bootstrap podman[30908]: 2020-03-02 00:14:29.833518442 +0000 UTC m=+5.504012363 container remove c7c8f8689857dc5b071c3b06ee47e042f5feb40dec1995b61f1355b8ce39f624 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:29 okd4-bootstrap bootkube.sh[747]: etcdctl failed. Retrying in 5 seconds...
Mar 02 00:14:34 okd4-bootstrap podman[31039]: 2020-03-02 00:14:34.968059049 +0000 UTC m=+0.112849058 container create 2f16910299bb9515551046bd93b2c6ff00e4df83f5b14e01966f8531c1a18f79 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:35 okd4-bootstrap podman[31039]: 2020-03-02 00:14:35.186749277 +0000 UTC m=+0.331539334 container init 2f16910299bb9515551046bd93b2c6ff00e4df83f5b14e01966f8531c1a18f79 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:35 okd4-bootstrap podman[31039]: 2020-03-02 00:14:35.220451999 +0000 UTC m=+0.365242009 container start 2f16910299bb9515551046bd93b2c6ff00e4df83f5b14e01966f8531c1a18f79 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:35 okd4-bootstrap podman[31039]: 2020-03-02 00:14:35.2207554 +0000 UTC m=+0.365545412 container attach 2f16910299bb9515551046bd93b2c6ff00e4df83f5b14e01966f8531c1a18f79 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:40 okd4-bootstrap bootkube.sh[747]: {"level":"warn","ts":"2020-03-02T00:14:40.204Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-216fc7e2-36cc-4f6b-b8fa-a37db1473139/10.11.11.99:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.11.11.99:2379: connect: connection refused\""}
Mar 02 00:14:40 okd4-bootstrap bootkube.sh[747]: https://10.11.11.99:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Mar 02 00:14:40 okd4-bootstrap bootkube.sh[747]: Error: unhealthy cluster
Mar 02 00:14:40 okd4-bootstrap podman[31039]: 2020-03-02 00:14:40.221844038 +0000 UTC m=+5.366634070 container died 2f16910299bb9515551046bd93b2c6ff00e4df83f5b14e01966f8531c1a18f79 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:40 okd4-bootstrap podman[31039]: 2020-03-02 00:14:40.31098892 +0000 UTC m=+5.455778922 container remove 2f16910299bb9515551046bd93b2c6ff00e4df83f5b14e01966f8531c1a18f79 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:40 okd4-bootstrap bootkube.sh[747]: etcdctl failed. Retrying in 5 seconds...
Mar 02 00:14:45 okd4-bootstrap podman[31155]: 2020-03-02 00:14:45.441823721 +0000 UTC m=+0.112959670 container create 6c4ae4af39e665f278bae5464dc6a7816cc0d441443ce94443e603d341336d21 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:45 okd4-bootstrap podman[31155]: 2020-03-02 00:14:45.667051013 +0000 UTC m=+0.338186971 container init 6c4ae4af39e665f278bae5464dc6a7816cc0d441443ce94443e603d341336d21 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:45 okd4-bootstrap podman[31155]: 2020-03-02 00:14:45.682995692 +0000 UTC m=+0.354131641 container start 6c4ae4af39e665f278bae5464dc6a7816cc0d441443ce94443e603d341336d21 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:45 okd4-bootstrap podman[31155]: 2020-03-02 00:14:45.683090711 +0000 UTC m=+0.354226669 container attach 6c4ae4af39e665f278bae5464dc6a7816cc0d441443ce94443e603d341336d21 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:50 okd4-bootstrap bootkube.sh[747]: {"level":"warn","ts":"2020-03-02T00:14:50.688Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-5ec77a5e-605e-43a6-bef5-a2e4713662f5/10.11.11.99:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.11.11.99:2379: connect: connection refused\""}
Mar 02 00:14:50 okd4-bootstrap bootkube.sh[747]: https://10.11.11.99:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Mar 02 00:14:50 okd4-bootstrap bootkube.sh[747]: Error: unhealthy cluster
Mar 02 00:14:50 okd4-bootstrap podman[31155]: 2020-03-02 00:14:50.708852171 +0000 UTC m=+5.379988142 container died 6c4ae4af39e665f278bae5464dc6a7816cc0d441443ce94443e603d341336d21 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:50 okd4-bootstrap podman[31155]: 2020-03-02 00:14:50.843319039 +0000 UTC m=+5.514454993 container remove 6c4ae4af39e665f278bae5464dc6a7816cc0d441443ce94443e603d341336d21 (image=registry.svc.ci.openshift.org/origin/4.4-2020-02-28-211838@sha256:dd646034fc2e4ec6bb606d1eac199b574aee10478c84d31caa81a791cbf99ead, name=etcdctl)
Mar 02 00:14:50 okd4-bootstrap bootkube.sh[747]: etcdctl failed. Retrying in 5 seconds...

Any suggestions? Google did not hit on any of these errors, so I suspect that I am doing something wrong, or I'm hitting an issue that isn't well known yet.

vrutkovs commented 4 years ago

Please attach log bundle for more info - https://docs.openshift.com/container-platform/4.3/installing/installing-gather-logs.html

cgruver commented 4 years ago

@vrutkovs Thanks Vadim, I will run it again when I get home this evening and attempt to get the log bundle.

cgruver commented 4 years ago

I seem to have made it worse... Now the bootstrap isn't even getting started. It is looping on:

Mar 04 12:02:55 okd4-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster.
Mar 04 12:02:55 okd4-bootstrap bootkube.sh[7285]: Error: error getting image "registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f": unable to find 'registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f' in local storage: no such image
Mar 04 12:02:55 okd4-bootstrap bootkube.sh[7285]: Warning: Could not resolve release image to pull by digest
Mar 04 12:02:56 okd4-bootstrap bootkube.sh[7285]: Error: unable to pull registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f: unable to pull image: Error initializing source docker://registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f: Error reading manifest sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f in registry.svc.ci.openshift.org/origin/release: manifest unknown: manifest unknown
Mar 04 12:02:56 okd4-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=125/n/a
Mar 04 12:02:56 okd4-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Mar 04 12:03:01 okd4-bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 72.
Mar 04 12:03:01 okd4-bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster.

I tried pulling the image that it is referring to on my laptop:

docker pull registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f
Error response from daemon: manifest for registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f not found: manifest unknown: manifest unknown

As you can see, I'm seeing a similar error.

Since I was seeing similar errors last night with the previous attempt that I opened this issue on, I pulled a different OKD release. This one is using:

oc adm release extract --command='openshift-install' registry.svc.ci.openshift.org/origin/release:4.4.0-0.okd-2020-03-03-170958

Is it possible that there is an issue with pulling registry.svc.ci.openshift.org/origin/release... images, or the registry?

Debug logs attached:

log-bundle-20200304070745.tar.gz

cgruver commented 4 years ago

FWIW, I tried this last night with several different OKD releases. Same behavior for all.

vrutkovs commented 4 years ago

We keep releases for 48hrs only, so it might have been deleted?

Try https://docs.openshift.com/container-platform/4.3/installing/install_config/installing-restricted-networks-preparations.html#installation-mirror-repository_installing-restricted-networks-preparations before your next deploy

cgruver commented 4 years ago

@vrutkovs

Vadim, is it possible that something has gone pancake shaped with the release images in the last day or so? Here is what I am seeing:

I extracted the installer from: release:4.4.0-0.okd-2020-03-03-170958 which has a SHA of 507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738.

However, the install is trying to pull: release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f which appears to be the wrong image. As noted above, if I try to pull the same image on my laptop, it also fails.

docker pull registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f
Error response from daemon: manifest for registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f not found: manifest unknown: manifest unknown

But, if I use the SHA that is correct:

docker pull registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738: Pulling from origin/release
34971b2d1eb9: Pull complete 
4fbc3bafa3d4: Pull complete 
b6b944cbc4e6: Pull complete 
4cb248145d19: Pull complete 
bf0437183dc9: Pull complete 
7a34e867ac9e: Pull complete 
Digest: sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
Status: Downloaded newer image for registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738

It works!

vrutkovs commented 4 years ago

$ oc adm release extract --command=openshift-install registry.svc.ci.openshift.org/origin/release:4.4.0-0.okd-2020-03-03-170958
$ ./openshift-install version
./openshift-install 4.4.0-0.okd-2020-03-03-170958
built from commit b8170d82bf1034d197c33b7e3118a03416b1725d
release image registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
$ skopeo inspect docker://registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738 | grep "io.openshift.release"
"io.openshift.release": "4.4.0-0.okd-2020-03-03-170958",

Make sure you've cleaned openshift-install dir and don't have OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE set anywhere

cgruver commented 4 years ago

Thanks.

Yes, I wipe the ignition configs and install directory with every attempt.

OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE is not set.

I am now attempting an install with:

export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738

I'll update with the results.

vrutkovs commented 4 years ago

No, don't do that.

Use a new clean dir - openshift-install --dir

cgruver commented 4 years ago

Too late... ;-)

I did use a clean openshift-install --dir

cgruver commented 4 years ago

I'm going to scrub the whole environment and try again from a fresh start.

Will update. Thanks for all your help.

cgruver commented 4 years ago

Well, I am completely at a loss. I'm sure I've got something messed up here, but after cleaning the whole environment up, I see the same behavior.

It still looks like it is trying to pull the wrong release image.

I have verified that the bootstrap machine can pull images. So it does not look like a connectivity, routing, or DNS issue.

[root@osc-controller01 okd4-lab]# ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null core@okd4-bootstrap 
Warning: Permanently added 'okd4-bootstrap,10.11.11.99' (ECDSA) to the list of known hosts.
This is the bootstrap node; it will be destroyed when the master is fully up.

The primary service is "bootkube.service". To watch its status, run e.g.

  journalctl -b -f -u bootkube.service
Fedora CoreOS 31.20200210.3.0
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/c/server/coreos/

[core@okd4-bootstrap ~]$ sudo bash
[root@okd4-bootstrap core]# docker pull registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f
Error response from daemon: manifest for registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f not found
[root@okd4-bootstrap core]# docker pull registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738: Pulling from origin/release
34971b2d1eb9: Pull complete 
4fbc3bafa3d4: Pull complete 
b6b944cbc4e6: Pull complete 
4cb248145d19: Pull complete 
bf0437183dc9: Pull complete 
7a34e867ac9e: Pull complete 
Digest: sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
Status: Downloaded newer image for registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
[root@okd4-bootstrap core]#

You see if I manually try to pull the image that bootkube.service is trying to pull, I see the same behavior.

If I pull release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738, it works.

I'm using openshift-install. should I be using openshift-baremetal-install?

Anything else I should look at?

Logs attached.

log-bundle-20200305070819.tar.gz

vrutkovs commented 4 years ago

Its not clear which release image is used in the installer, so I don't know which release you're installing - all I can see is that its gone.

Try mirroring the latest available release to a safe location - https://docs.openshift.com/container-platform/4.3/installing/install_config/installing-restricted-networks-preparations.html

cgruver commented 4 years ago

This is the release the I believe it should be:

release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738

I am using openshift-install extracted from 4.4.0-0.okd-2020-03-03-170958 which matches the hash above.

oc adm release extract --command=openshift-install registry.svc.ci.openshift.org/origin/release:4.4.0-0.okd-2020-03-03-170958

[root@osc-controller01 okd4-lab]# openshift-install version
openshift-install 4.4.0-0.okd-2020-03-03-170958
built from commit b8170d82bf1034d197c33b7e3118a03416b1725d
release image registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738

But the bootstrap node is clearly trying to pull registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f

It is really strange... I will try your suggestion. I'm also moving to registry.svc.ci.openshift.org/origin/release:4.4.0-0.okd-2020-03-04-125648 from yesterday.

I'll update on progress.

Thanks!

cgruver commented 4 years ago

@vrutkovs

I found something!!!

This is openshift-install 4.4.0-0.okd-2020-03-03-170958

I did some surgery on the bootstrap.ign file created by openshift-install with the following install-config:

apiVersion: v1
baseDomain: my.cluster.domain
metadata:
  name: okd4
networking:
  networkType: OpenShiftSDN
  clusterNetwork:
  - cidr: 10.100.0.0/14 
    hostPrefix: 23 
  serviceNetwork: 
  - 172.30.0.0/16
  machineNetwork:
  - cidr: 10.11.11.0/24
compute:
- name: worker
  replicas: 3
controlPlane:
  name: master
  replicas: 3
platform:
  none: {}
pullSecret: '{"auths": {"quay.io": {"auth": "Y2dydXZlcjpVci9REDACTED", "email": ""}}}'
sshKey: ssh-rsa AAAAB3NzaC1ycREDACTED root@my-bastion-host

Below is the entry for /usr/local/bin/release-image-download.sh","user":{"name":"root"},"contents":{"source":"data:text/plain;charset=utf-8;base64, decoded.

It's got the wrong release image in it. I've extracted multiple versions of openshift-install from recent releases. They all seem to be putting this same release image into the ignition file.

#!/usr/bin/env bash
set -euo pipefail
# Download the release image. This script is executed as a oneshot
# service by systemd, because we cannot make use of Requires and a
# simple service: https://github.com/systemd/systemd/issues/1312.
#
# This script continues trying to download the release image until
# successful because we cannot use Restart=on-failure with a oneshot
# service: https://github.com/systemd/systemd/issues/2582.
#

RELEASE_IMAGE=registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f

echo "Pulling $RELEASE_IMAGE..."
while ! podman pull --quiet "$RELEASE_IMAGE"
do
    echo "Pull failed. Retrying $RELEASE_IMAGE..."
done

It's got this rogue release image coded into it instead of release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738

I am going to try your suggestion of creating a local mirror. If that overrides the source image, then it will probably get me around this issue, but is it possible that recent releases of the OKD image have the wrong release hardcoded into them?

vrutkovs commented 4 years ago

It's got the wrong release image in it.

Could you attach .openshift_install.log from the install dir? It might have clues why openshift-installer has changed the release image

cgruver commented 4 years ago

openshift_install.log

cgruver commented 4 years ago

It says right at the top: OpenShift Installer 4.4.0-0.okd-2020-02-28-211838!!!

Yet I get this:

[root@osc-controller01 okd4-install]# openshift-install version
openshift-install 4.4.0-0.okd-2020-03-04-125648
built from commit b8170d82bf1034d197c33b7e3118a03416b1725d
release image registry.svc.ci.openshift.org/origin/release@sha256:07d8cc840611e1b80d37d1f27c90c31758f2b6208930e20e75507b14d89a3266

Twilight Zone dude!

cgruver commented 4 years ago

Wait... false alarm. That's got a lot of previous builds in it.

Let me create a fresh one.

cgruver commented 4 years ago

openshift_install.log

Fresh log. I also feel a bit of an idiot...

When I've been cleaning up after an install, I've been removing the contents of okd4-install.

However, I completely forgot about .openshift_install_state.json

I bet the stale .openshift_install_state.json file is poisoning my installation attempts!

I'm running a new install. Will update with results.

vrutkovs commented 4 years ago

I bet the stale .openshift_install_state.json file is poisoning my installation attempts!

Yes :) I usually just rm -rf clusters/foo && openshift-installer --dir cluster/foo just to be sure

cgruver commented 4 years ago

Yes... important safety tip.

I have modified my procedures accordingly. It is putting the correct image into the ignition file now.

#!/usr/bin/env bash
set -euo pipefail
# Download the release image. This script is executed as a oneshot
# service by systemd, because we cannot make use of Requires and a
# simple service: https://github.com/systemd/systemd/issues/1312.
#
# This script continues trying to download the release image until
# successful because we cannot use Restart=on-failure with a oneshot
# service: https://github.com/systemd/systemd/issues/2582.
#

RELEASE_IMAGE=registry.svc.ci.openshift.org/origin/release@sha256:07d8cc840611e1b80d37d1f27c90c31758f2b6208930e20e75507b14d89a3266

echo "Pulling $RELEASE_IMAGE..."
while ! podman pull --quiet "$RELEASE_IMAGE"
do
    echo "Pull failed. Retrying $RELEASE_IMAGE..."
done

Now, back to testing the install. I'll upload the logs if the original issue still occurs.

cgruver commented 4 years ago

Progress!

Now it is looping on:

Mar 05 17:46:20 okd4-bootstrap bootkube.sh[19998]: [#3033] failed to create some manifests:
Mar 05 17:46:20 okd4-bootstrap bootkube.sh[19998]: "99_openshift-machineconfig_99-master-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-master-ssh.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1"

I'm used to seeing it stuck there for a few minutes while pods are starting, but it's been looping on this for nearly an hour now.

logs attached.

log-bundle-20200305123643.tar.gz

vrutkovs commented 4 years ago

That error means "machine-config operator didn't start yet and I don't know how to handle MachineConfig objects". Does it crash after 30 mins?

cgruver commented 4 years ago

Yes. It crashes and the bootstrap control plane restarts:

Mar 05 16:53:47 okd4-bootstrap bootkube.sh[749]: Error: error while checking pod status: timed out waiting for the condition
Mar 05 16:53:47 okd4-bootstrap bootkube.sh[749]: Tearing down temporary bootstrap control plane...
Mar 05 16:53:47 okd4-bootstrap bootkube.sh[749]: Error: error while checking pod status: timed out waiting for the condition
Mar 05 16:53:47 okd4-bootstrap podman[6547]: 2020-03-05 16:53:47.126982148 +0000 UTC m=+1200.565302594 container died cf7ebaaba61212fb59d50abbb7124e431d3f682c8b4bd0a9d4783ab5a44dc44a (image=registry.svc.ci.openshift.org/origin/4.4-2020-03-04-125648@sha256:e80f6486d558776d8b331d7856f5ba3bbaff476764b395e82d279ce86c6bb11d, name=zen_maxwell)
Mar 05 16:53:47 okd4-bootstrap podman[6547]: 2020-03-05 16:53:47.257636685 +0000 UTC m=+1200.695957098 container remove cf7ebaaba61212fb59d50abbb7124e431d3f682c8b4bd0a9d4783ab5a44dc44a (image=registry.svc.ci.openshift.org/origin/4.4-2020-03-04-125648@sha256:e80f6486d558776d8b331d7856f5ba3bbaff476764b395e82d279ce86c6bb11d, name=zen_maxwell)
Mar 05 16:53:47 okd4-bootstrap bootkube.sh[749]: Error: Failed to evict container: "": Failed to find container "etcd-signer" in state: no container with name or ID etcd-signer found: no such container
Mar 05 16:53:47 okd4-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Mar 05 16:53:47 okd4-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Mar 05 16:53:47 okd4-bootstrap systemd[1]: bootkube.service: Consumed 1min 604ms CPU time.
Mar 05 16:53:52 okd4-bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 1.
Mar 05 16:53:52 okd4-bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Mar 05 16:53:52 okd4-bootstrap systemd[1]: bootkube.service: Consumed 1min 604ms CPU time.
Mar 05 16:53:52 okd4-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster.

cgruver commented 4 years ago

Should I be using the OpenShift-baremetal-installer?

oc adm release extract --command='openshift-baremetal-install' registry.svc.ci.openshift.org/origin/release:4.4.0-0.okd-2020-03-04-125648

cgruver commented 4 years ago

Same behavior with openshift-baremetal-install.

Tried it just for fun...

vrutkovs commented 4 years ago

So disappearing release images have been resolved, right?

Notes on findings: masters didn't join the cluster. Charro, can you ssh on masters and fetch kubelet logs from these?

kube-apiserver:

log.go:172] http: TLS handshake error from 10.11.11.100:56312: tls: client offered only unsupported versions: [300]

Which node is 10.11.11.100? It appears 10.11.11.99 is bootstrap and 10.11.11.101 is one of the masters?

cgruver commented 4 years ago

Yes. The release image issue was self-inflicted... problem between keyboard and seat.

Masters did not get to a point where SSH was available. They were waiting for port 22623 to become available on the bootstrap, which never did because of the machine config issue.

10.11.11.100 is the router, (HAProxy). 10.11.11.101-103 are the master nodes. 10.11.11.99 is the bootstrap.

I'll try again in the morning with any suggestions that you have.

cgruver commented 4 years ago

I believe that my router was not properly forwarding DNS for my cluster domain. The bootstrap and master nodes were unable to resolve api-int.okd4.oscluster.clgcom.org to get to: https://api-int.okd4.oscluster.clgcom.org:22623/config/master

I have modified the router config to properly forward DNS.

sgreene570 commented 4 years ago

I believe that my router was not properly forwarding DNS for my cluster domain. The bootstrap and master nodes were unable to resolve api-int.okd4.oscluster.clgcom.org to get to: https://api-int.okd4.oscluster.clgcom.org:22623/config/master

I have modified the router config to properly forward DNS.

Ah yep, that will cause issues. Make sure the other DNS requirements are configured properly while you're at it (etcd SRV/CNAME records, etc.)

cgruver commented 4 years ago

Success! Or, at least progress.

It looks like the bootstrap succeeded:

Mar 07 16:26:21 okd4-bootstrap bootkube.sh[746]:         Pod Status:openshift-kube-scheduler/openshift-kube-scheduler        RunningNotReady
Mar 07 16:26:21 okd4-bootstrap bootkube.sh[746]:         Pod Status:openshift-kube-controller-manager/kube-controller-manager        Ready
Mar 07 16:26:21 okd4-bootstrap bootkube.sh[746]:         Pod Status:openshift-cluster-version/cluster-version-operator        Ready
Mar 07 16:26:21 okd4-bootstrap bootkube.sh[746]:         Pod Status:openshift-kube-apiserver/kube-apiserver        Ready
Mar 07 16:26:26 okd4-bootstrap bootkube.sh[746]:         Pod Status:openshift-kube-apiserver/kube-apiserver        Ready
Mar 07 16:26:26 okd4-bootstrap bootkube.sh[746]:         Pod Status:openshift-kube-scheduler/openshift-kube-scheduler        Ready
Mar 07 16:26:26 okd4-bootstrap bootkube.sh[746]:         Pod Status:openshift-kube-controller-manager/kube-controller-manager        Ready
Mar 07 16:26:26 okd4-bootstrap bootkube.sh[746]:         Pod Status:openshift-cluster-version/cluster-version-operator        Ready
Mar 07 16:26:26 okd4-bootstrap bootkube.sh[746]: All self-hosted control plane components successfully started
Mar 07 16:26:26 okd4-bootstrap bootkube.sh[746]: Sending bootstrap-success event.Waiting for remaining assets to be created.

However, before completion, it did throw some errors. Are these significant?

Mar 07 16:29:30 okd4-bootstrap bootkube.sh[746]: E0307 16:29:30.853121       1 reflector.go:153] k8s.io/client-go@v0.17.1/tools/cache/reflector.go:105: Failed to list *v1.Etcd: Get https://api-int.okd4.oscluster.clgcom.org:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0: stream error: stream ID 123; INTERNAL_ERROR
Mar 07 16:29:31 okd4-bootstrap bootkube.sh[746]: E0307 16:29:31.965890       1 reflector.go:153] k8s.io/client-go@v0.17.1/tools/cache/reflector.go:105: Failed to list *v1.Etcd: Get https://api-int.okd4.oscluster.clgcom.org:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0: stream error: stream ID 125; INTERNAL_ERROR
Mar 07 16:29:33 okd4-bootstrap bootkube.sh[746]: I0307 16:29:33.007587       1 waitforceo.go:64] Cluster etcd operator bootstrapped successfully
Mar 07 16:29:33 okd4-bootstrap bootkube.sh[746]: I0307 16:29:33.014396       1 waitforceo.go:58] cluster-etcd-operator bootstrap etcd
Mar 07 16:29:33 okd4-bootstrap podman[9657]: 2020-03-07 16:29:33.062319447 +0000 UTC m=+69.225380802 container died cd0752e017463b63b1321143632b110b98f65252a79ce8611c1fca3227f33dc0 (image=registry.svc.ci.openshift.org/origin/4.4-2020-03-07-110054@sha256:2b48596c20d8a4d8afa2544d52088c6b7b668281412cbc59639dcd8837f3b4ad, name=hungry_agnesi)
Mar 07 16:29:33 okd4-bootstrap bootkube.sh[746]: bootkube.service complete
Mar 07 16:29:33 okd4-bootstrap systemd[1]: bootkube.service: Succeeded.
Mar 07 16:29:33 okd4-bootstrap systemd[1]: bootkube.service: Consumed 42.596s CPU time.

I'm tearing down the bootstrap now to see how the cluster stands alone.

The issue yesterday with getting past MachineConfig was network bandwidth. The guest wireless network that I had my portable lab on, just did not have enough bandwidth for the bootstrap to complete before timing out at 30 minutes.

vrutkovs commented 4 years ago

However, before completion, it did throw some errors. Are these significant?

That's expected - this happens during the switch to persistent control plane.

Did the install complete? Any docs required / small issues to fix in the process?

cgruver commented 4 years ago

This install is progressing.

I approved all the CSRs and now the three workers are joining the cluster.

INFO Waiting up to 30m0s for the cluster at https://api.okd4.oscluster.clgcom.org:6443 to initialize... 
ERROR Cluster operator authentication Degraded is True with RouteHealth_FailedGet: RouteHealthDegraded: failed to GET route: dial tcp: lookup oauth-openshift.apps.okd4.oscluster.clgcom.org on 172.30.0.10:53: no such host 
INFO Cluster operator authentication Progressing is Unknown with NoData:  
INFO Cluster operator authentication Available is Unknown with NoData:  
INFO Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.4.0-0.okd-2020-03-07-110054 
INFO Cluster operator console Available is False with Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for console deployment 
INFO Cluster operator insights Disabled is True with Disabled: Health reporting is disabled 
INFO Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 5; 0 nodes have achieved new revision 7 
INFO Cluster operator support Disabled is True with : Health reporting is disabled 
FATAL failed to initialize the cluster: Working towards 4.4.0-0.okd-2020-03-07-110054: 100% complete, waiting on authentication, console

I forgot to approve the CSRs right away, so I ran openshift-install --dir=okd4-install wait-for install-complete again to continue monitoring.

I'm going to document my whole setup and install process. We can cherry-pick from that for the official docs. The vbmc and iPXE config might be useful to others.

Will Update shortly.

cgruver commented 4 years ago

Current status:

INFO Waiting up to 30m0s for the cluster at https://api.okd4.oscluster.clgcom.org:6443 to initialize... 
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console

cgruver commented 4 years ago

@vrutkovs

Question for you: I used ephemeral storage for the registry.

oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"managementState":"Managed","storage":{"emptyDir":{}}}}'

If I patch it again later to use an iSCSI LUN, will that break anything?

vrutkovs commented 4 years ago

If I patch it again later to use an iSCSI LUN, will that break anything?

Replacing the backend is supported, however I doubt that data would migrate automagically

cgruver commented 4 years ago

It's ALIVE!!!

DEBUG OpenShift Installer 4.4.0-0.okd-2020-03-07-110054 
DEBUG Built from commit f0d3afed3c4655a6514fdfc54bc40348f0aac80b 
INFO Waiting up to 30m0s for the cluster at https://api.okd4.oscluster.clgcom.org:6443 to initialize... 
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console 

DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console 
E0307 12:37:18.657292    5569 reflector.go:280] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: Get https://api.okd4.oscluster.clgcom.org:6443/apis/config.openshift.io/v1/clusterversions?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dversion&resourceVersion=36180&timeoutSeconds=338&watch=true: dial tcp 10.11.11.100:6443: connect: connection refused
DEBUG Cluster is initialized                       
INFO Waiting up to 10m0s for the openshift-console route to be created... 
DEBUG Route found in openshift-console namespace: console 
DEBUG Route found in openshift-console namespace: downloads 
DEBUG OpenShift console route is created           
INFO Install complete!                            
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/root/okd4-lab/okd4-install/auth/kubeconfig' 
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.okd4.oscluster.clgcom.org 
INFO Login to the console with user: kubeadmin, password: nNoaw-8YrfZ-5Bddf-FwUoa

The last thing that I missed was that the Ingress Routers run on worker nodes. HAProxy was not directing 443 and 80 to worker nodes. It was directing to masters.

I'm going to destroy it and run it again to verify all the steps. Then I'll document my process.

Thanks for being there to help Vadim!

cgruver commented 4 years ago

If I patch it again later to use an iSCSI LUN, will that break anything?

Replacing the backend is supported, however I doubt that data would migrate automagically

OK. But, if I do that right after creating the cluster, there's no data in there that is going to be missed by the cluster, correct?

cgruver commented 4 years ago

vrutkovs commented 4 years ago

Woot! Feel free to ping me on Slack if you need a live debug session :)

OK. But, if I do that right after creating the cluster, there's no data in there that is going to be missed by the cluster, correct?

Correct, the registry would be empty unless some builds have happened

cgruver commented 4 years ago

Will do.

I CAN'T tell you how excited I am to get this going. I'm going to get it set up for my team in our lab at work. We are running three clusters of OCP 3.11 on RHV 4.2 in our data-center. This is our first step to getting them ready for the transition of OCP 4.X. Probably 4.5. I'm also really interested in getting IPI going with KubeVirt. I envision eliminating RHV from the environment and letting OCP manage the whole environment.

I'll close this issue, and post my documentation when I get it written up in a human friendly format. I'll put it in my GitHub account. We can cherry-pick from there, if there is anything useful.

cgruver commented 4 years ago

Cluster deployed successfully.

srinivasmmdl commented 1 year ago

Hello All,

I have a similar issue with the openshift-installer 4.12.20 which has the reference to the release image of 4.12.15. Will it work if I export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE variable with the correct image name that shown at end of oc adm release mirror

okd-project / okd

Bare Metal UPI (libvirt): Etcd bootstrap looping on errors #90