Closed cgruver closed 4 years ago
Please attach log bundle for more info - https://docs.openshift.com/container-platform/4.3/installing/installing-gather-logs.html
@vrutkovs Thanks Vadim, I will run it again when I get home this evening and attempt to get the log bundle.
I seem to have made it worse... Now the bootstrap isn't even getting started. It is looping on:
Mar 04 12:02:55 okd4-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster.
Mar 04 12:02:55 okd4-bootstrap bootkube.sh[7285]: Error: error getting image "registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f": unable to find 'registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f' in local storage: no such image
Mar 04 12:02:55 okd4-bootstrap bootkube.sh[7285]: Warning: Could not resolve release image to pull by digest
Mar 04 12:02:56 okd4-bootstrap bootkube.sh[7285]: Error: unable to pull registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f: unable to pull image: Error initializing source docker://registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f: Error reading manifest sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f in registry.svc.ci.openshift.org/origin/release: manifest unknown: manifest unknown
Mar 04 12:02:56 okd4-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=125/n/a
Mar 04 12:02:56 okd4-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Mar 04 12:03:01 okd4-bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 72.
Mar 04 12:03:01 okd4-bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster.
I tried pulling the image that it is referring to on my laptop:
docker pull registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f
Error response from daemon: manifest for registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f not found: manifest unknown: manifest unknown
As you can see, I'm seeing a similar error.
Since I was seeing similar errors last night with the previous attempt that I opened this issue on, I pulled a different OKD release. This one is using:
oc adm release extract --command='openshift-install' registry.svc.ci.openshift.org/origin/release:4.4.0-0.okd-2020-03-03-170958
Is it possible that there is an issue with pulling registry.svc.ci.openshift.org/origin/release...
images, or the registry?
Debug logs attached:
FWIW, I tried this last night with several different OKD releases. Same behavior for all.
We keep releases for 48hrs only, so it might have been deleted?
Try https://docs.openshift.com/container-platform/4.3/installing/install_config/installing-restricted-networks-preparations.html#installation-mirror-repository_installing-restricted-networks-preparations before your next deploy
@vrutkovs
Vadim, is it possible that something has gone pancake shaped with the release images in the last day or so? Here is what I am seeing:
I extracted the installer from: release:4.4.0-0.okd-2020-03-03-170958
which has a SHA of 507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
.
However, the install is trying to pull: release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f
which appears to be the wrong image. As noted above, if I try to pull the same image on my laptop, it also fails.
docker pull registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f
Error response from daemon: manifest for registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f not found: manifest unknown: manifest unknown
But, if I use the SHA that is correct:
docker pull registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738: Pulling from origin/release
34971b2d1eb9: Pull complete
4fbc3bafa3d4: Pull complete
b6b944cbc4e6: Pull complete
4cb248145d19: Pull complete
bf0437183dc9: Pull complete
7a34e867ac9e: Pull complete
Digest: sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
Status: Downloaded newer image for registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
It works!
$ oc adm release extract --command=openshift-install registry.svc.ci.openshift.org/origin/release:4.4.0-0.okd-2020-03-03-170958
$ ./openshift-install version
./openshift-install 4.4.0-0.okd-2020-03-03-170958
built from commit b8170d82bf1034d197c33b7e3118a03416b1725d
release image registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
$ skopeo inspect docker://registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738 | grep "io.openshift.release"
"io.openshift.release": "4.4.0-0.okd-2020-03-03-170958",
Make sure you've cleaned openshift-install dir and don't have OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE set anywhere
Thanks.
Yes, I wipe the ignition configs and install directory with every attempt.
OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE is not set.
I am now attempting an install with:
export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
I'll update with the results.
No, don't do that.
Use a new clean dir - openshift-install --dir
Too late... ;-)
I did use a clean openshift-install --dir
I'm going to scrub the whole environment and try again from a fresh start.
Will update. Thanks for all your help.
Well, I am completely at a loss. I'm sure I've got something messed up here, but after cleaning the whole environment up, I see the same behavior.
It still looks like it is trying to pull the wrong release image.
I have verified that the bootstrap machine can pull images. So it does not look like a connectivity, routing, or DNS issue.
[root@osc-controller01 okd4-lab]# ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null core@okd4-bootstrap
Warning: Permanently added 'okd4-bootstrap,10.11.11.99' (ECDSA) to the list of known hosts.
This is the bootstrap node; it will be destroyed when the master is fully up.
The primary service is "bootkube.service". To watch its status, run e.g.
journalctl -b -f -u bootkube.service
Fedora CoreOS 31.20200210.3.0
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/c/server/coreos/
[core@okd4-bootstrap ~]$ sudo bash
[root@okd4-bootstrap core]# docker pull registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f
Error response from daemon: manifest for registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f not found
[root@okd4-bootstrap core]# docker pull registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738: Pulling from origin/release
34971b2d1eb9: Pull complete
4fbc3bafa3d4: Pull complete
b6b944cbc4e6: Pull complete
4cb248145d19: Pull complete
bf0437183dc9: Pull complete
7a34e867ac9e: Pull complete
Digest: sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
Status: Downloaded newer image for registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
[root@okd4-bootstrap core]#
You see if I manually try to pull the image that bootkube.service
is trying to pull, I see the same behavior.
If I pull release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
, it works.
I'm using openshift-install. should I be using openshift-baremetal-install?
Anything else I should look at?
Logs attached.
Its not clear which release image is used in the installer, so I don't know which release you're installing - all I can see is that its gone.
Try mirroring the latest available release to a safe location - https://docs.openshift.com/container-platform/4.3/installing/install_config/installing-restricted-networks-preparations.html
This is the release the I believe it should be:
release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
I am using openshift-install extracted from 4.4.0-0.okd-2020-03-03-170958
which matches the hash above.
oc adm release extract --command=openshift-install registry.svc.ci.openshift.org/origin/release:4.4.0-0.okd-2020-03-03-170958
[root@osc-controller01 okd4-lab]# openshift-install version
openshift-install 4.4.0-0.okd-2020-03-03-170958
built from commit b8170d82bf1034d197c33b7e3118a03416b1725d
release image registry.svc.ci.openshift.org/origin/release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
But the bootstrap node is clearly trying to pull registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f
It is really strange... I will try your suggestion. I'm also moving to registry.svc.ci.openshift.org/origin/release:4.4.0-0.okd-2020-03-04-125648
from yesterday.
I'll update on progress.
Thanks!
@vrutkovs
I found something!!!
This is openshift-install 4.4.0-0.okd-2020-03-03-170958
I did some surgery on the bootstrap.ign file created by openshift-install with the following install-config:
apiVersion: v1
baseDomain: my.cluster.domain
metadata:
name: okd4
networking:
networkType: OpenShiftSDN
clusterNetwork:
- cidr: 10.100.0.0/14
hostPrefix: 23
serviceNetwork:
- 172.30.0.0/16
machineNetwork:
- cidr: 10.11.11.0/24
compute:
- name: worker
replicas: 3
controlPlane:
name: master
replicas: 3
platform:
none: {}
pullSecret: '{"auths": {"quay.io": {"auth": "Y2dydXZlcjpVci9REDACTED", "email": ""}}}'
sshKey: ssh-rsa AAAAB3NzaC1ycREDACTED root@my-bastion-host
Below is the entry for /usr/local/bin/release-image-download.sh","user":{"name":"root"},"contents":{"source":"data:text/plain;charset=utf-8;base64,
decoded.
It's got the wrong release image in it. I've extracted multiple versions of openshift-install from recent releases. They all seem to be putting this same release image into the ignition file.
#!/usr/bin/env bash
set -euo pipefail
# Download the release image. This script is executed as a oneshot
# service by systemd, because we cannot make use of Requires and a
# simple service: https://github.com/systemd/systemd/issues/1312.
#
# This script continues trying to download the release image until
# successful because we cannot use Restart=on-failure with a oneshot
# service: https://github.com/systemd/systemd/issues/2582.
#
RELEASE_IMAGE=registry.svc.ci.openshift.org/origin/release@sha256:24a8dcc1dd1508d8722fe321b2eb5a21a63adfecc16bb1b0c2c9e6d7f5cea11f
echo "Pulling $RELEASE_IMAGE..."
while ! podman pull --quiet "$RELEASE_IMAGE"
do
echo "Pull failed. Retrying $RELEASE_IMAGE..."
done
It's got this rogue release image coded into it instead of release@sha256:507d600e377489c4cce6bc2f34d4176f4bac005298e9e86865010a7c45546738
I am going to try your suggestion of creating a local mirror. If that overrides the source image, then it will probably get me around this issue, but is it possible that recent releases of the OKD image have the wrong release hardcoded into them?
It's got the wrong release image in it.
Could you attach .openshift_install.log
from the install dir? It might have clues why openshift-installer has changed the release image
It says right at the top: OpenShift Installer 4.4.0-0.okd-2020-02-28-211838!!!
Yet I get this:
[root@osc-controller01 okd4-install]# openshift-install version
openshift-install 4.4.0-0.okd-2020-03-04-125648
built from commit b8170d82bf1034d197c33b7e3118a03416b1725d
release image registry.svc.ci.openshift.org/origin/release@sha256:07d8cc840611e1b80d37d1f27c90c31758f2b6208930e20e75507b14d89a3266
Twilight Zone dude!
Wait... false alarm. That's got a lot of previous builds in it.
Let me create a fresh one.
Fresh log. I also feel a bit of an idiot...
When I've been cleaning up after an install, I've been removing the contents of okd4-install
.
However, I completely forgot about .openshift_install_state.json
I bet the stale .openshift_install_state.json
file is poisoning my installation attempts!
I'm running a new install. Will update with results.
I bet the stale .openshift_install_state.json file is poisoning my installation attempts!
Yes :) I usually just rm -rf clusters/foo && openshift-installer --dir cluster/foo
just to be sure
Yes... important safety tip.
I have modified my procedures accordingly. It is putting the correct image into the ignition file now.
#!/usr/bin/env bash
set -euo pipefail
# Download the release image. This script is executed as a oneshot
# service by systemd, because we cannot make use of Requires and a
# simple service: https://github.com/systemd/systemd/issues/1312.
#
# This script continues trying to download the release image until
# successful because we cannot use Restart=on-failure with a oneshot
# service: https://github.com/systemd/systemd/issues/2582.
#
RELEASE_IMAGE=registry.svc.ci.openshift.org/origin/release@sha256:07d8cc840611e1b80d37d1f27c90c31758f2b6208930e20e75507b14d89a3266
echo "Pulling $RELEASE_IMAGE..."
while ! podman pull --quiet "$RELEASE_IMAGE"
do
echo "Pull failed. Retrying $RELEASE_IMAGE..."
done
Now, back to testing the install. I'll upload the logs if the original issue still occurs.
Progress!
Now it is looping on:
Mar 05 17:46:20 okd4-bootstrap bootkube.sh[19998]: [#3033] failed to create some manifests:
Mar 05 17:46:20 okd4-bootstrap bootkube.sh[19998]: "99_openshift-machineconfig_99-master-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-master-ssh.yaml": no matches for kind "MachineConfig" in version "machineconfiguration.openshift.io/v1"
I'm used to seeing it stuck there for a few minutes while pods are starting, but it's been looping on this for nearly an hour now.
logs attached.
That error means "machine-config operator didn't start yet and I don't know how to handle MachineConfig objects". Does it crash after 30 mins?
Yes. It crashes and the bootstrap control plane restarts:
Mar 05 16:53:47 okd4-bootstrap bootkube.sh[749]: Error: error while checking pod status: timed out waiting for the condition
Mar 05 16:53:47 okd4-bootstrap bootkube.sh[749]: Tearing down temporary bootstrap control plane...
Mar 05 16:53:47 okd4-bootstrap bootkube.sh[749]: Error: error while checking pod status: timed out waiting for the condition
Mar 05 16:53:47 okd4-bootstrap podman[6547]: 2020-03-05 16:53:47.126982148 +0000 UTC m=+1200.565302594 container died cf7ebaaba61212fb59d50abbb7124e431d3f682c8b4bd0a9d4783ab5a44dc44a (image=registry.svc.ci.openshift.org/origin/4.4-2020-03-04-125648@sha256:e80f6486d558776d8b331d7856f5ba3bbaff476764b395e82d279ce86c6bb11d, name=zen_maxwell)
Mar 05 16:53:47 okd4-bootstrap podman[6547]: 2020-03-05 16:53:47.257636685 +0000 UTC m=+1200.695957098 container remove cf7ebaaba61212fb59d50abbb7124e431d3f682c8b4bd0a9d4783ab5a44dc44a (image=registry.svc.ci.openshift.org/origin/4.4-2020-03-04-125648@sha256:e80f6486d558776d8b331d7856f5ba3bbaff476764b395e82d279ce86c6bb11d, name=zen_maxwell)
Mar 05 16:53:47 okd4-bootstrap bootkube.sh[749]: Error: Failed to evict container: "": Failed to find container "etcd-signer" in state: no container with name or ID etcd-signer found: no such container
Mar 05 16:53:47 okd4-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Mar 05 16:53:47 okd4-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Mar 05 16:53:47 okd4-bootstrap systemd[1]: bootkube.service: Consumed 1min 604ms CPU time.
Mar 05 16:53:52 okd4-bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 1.
Mar 05 16:53:52 okd4-bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Mar 05 16:53:52 okd4-bootstrap systemd[1]: bootkube.service: Consumed 1min 604ms CPU time.
Mar 05 16:53:52 okd4-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster.
Should I be using the OpenShift-baremetal-installer
?
oc adm release extract --command='openshift-baremetal-install' registry.svc.ci.openshift.org/origin/release:4.4.0-0.okd-2020-03-04-125648
Same behavior with openshift-baremetal-install
.
Tried it just for fun...
So disappearing release images have been resolved, right?
Notes on findings:
masters didn't join the cluster. Charro, can you ssh on masters and fetch kubelet
logs from these?
kube-apiserver:
log.go:172] http: TLS handshake error from 10.11.11.100:56312: tls: client offered only unsupported versions: [300]
Which node is 10.11.11.100
? It appears 10.11.11.99
is bootstrap and 10.11.11.101 is one of the masters?
Yes. The release image issue was self-inflicted... problem between keyboard and seat.
Masters did not get to a point where SSH was available. They were waiting for port 22623 to become available on the bootstrap, which never did because of the machine config issue.
10.11.11.100 is the router, (HAProxy). 10.11.11.101-103 are the master nodes. 10.11.11.99 is the bootstrap.
I'll try again in the morning with any suggestions that you have.
I believe that my router was not properly forwarding DNS for my cluster domain. The bootstrap and master nodes were unable to resolve api-int.okd4.oscluster.clgcom.org
to get to: https://api-int.okd4.oscluster.clgcom.org:22623/config/master
I have modified the router config to properly forward DNS.
I believe that my router was not properly forwarding DNS for my cluster domain. The bootstrap and master nodes were unable to resolve
api-int.okd4.oscluster.clgcom.org
to get to:https://api-int.okd4.oscluster.clgcom.org:22623/config/master
I have modified the router config to properly forward DNS.
Ah yep, that will cause issues. Make sure the other DNS requirements are configured properly while you're at it (etcd SRV/CNAME records, etc.)
Success! Or, at least progress.
It looks like the bootstrap succeeded:
Mar 07 16:26:21 okd4-bootstrap bootkube.sh[746]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler RunningNotReady
Mar 07 16:26:21 okd4-bootstrap bootkube.sh[746]: Pod Status:openshift-kube-controller-manager/kube-controller-manager Ready
Mar 07 16:26:21 okd4-bootstrap bootkube.sh[746]: Pod Status:openshift-cluster-version/cluster-version-operator Ready
Mar 07 16:26:21 okd4-bootstrap bootkube.sh[746]: Pod Status:openshift-kube-apiserver/kube-apiserver Ready
Mar 07 16:26:26 okd4-bootstrap bootkube.sh[746]: Pod Status:openshift-kube-apiserver/kube-apiserver Ready
Mar 07 16:26:26 okd4-bootstrap bootkube.sh[746]: Pod Status:openshift-kube-scheduler/openshift-kube-scheduler Ready
Mar 07 16:26:26 okd4-bootstrap bootkube.sh[746]: Pod Status:openshift-kube-controller-manager/kube-controller-manager Ready
Mar 07 16:26:26 okd4-bootstrap bootkube.sh[746]: Pod Status:openshift-cluster-version/cluster-version-operator Ready
Mar 07 16:26:26 okd4-bootstrap bootkube.sh[746]: All self-hosted control plane components successfully started
Mar 07 16:26:26 okd4-bootstrap bootkube.sh[746]: Sending bootstrap-success event.Waiting for remaining assets to be created.
However, before completion, it did throw some errors. Are these significant?
Mar 07 16:29:30 okd4-bootstrap bootkube.sh[746]: E0307 16:29:30.853121 1 reflector.go:153] k8s.io/client-go@v0.17.1/tools/cache/reflector.go:105: Failed to list *v1.Etcd: Get https://api-int.okd4.oscluster.clgcom.org:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0: stream error: stream ID 123; INTERNAL_ERROR
Mar 07 16:29:31 okd4-bootstrap bootkube.sh[746]: E0307 16:29:31.965890 1 reflector.go:153] k8s.io/client-go@v0.17.1/tools/cache/reflector.go:105: Failed to list *v1.Etcd: Get https://api-int.okd4.oscluster.clgcom.org:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0: stream error: stream ID 125; INTERNAL_ERROR
Mar 07 16:29:33 okd4-bootstrap bootkube.sh[746]: I0307 16:29:33.007587 1 waitforceo.go:64] Cluster etcd operator bootstrapped successfully
Mar 07 16:29:33 okd4-bootstrap bootkube.sh[746]: I0307 16:29:33.014396 1 waitforceo.go:58] cluster-etcd-operator bootstrap etcd
Mar 07 16:29:33 okd4-bootstrap podman[9657]: 2020-03-07 16:29:33.062319447 +0000 UTC m=+69.225380802 container died cd0752e017463b63b1321143632b110b98f65252a79ce8611c1fca3227f33dc0 (image=registry.svc.ci.openshift.org/origin/4.4-2020-03-07-110054@sha256:2b48596c20d8a4d8afa2544d52088c6b7b668281412cbc59639dcd8837f3b4ad, name=hungry_agnesi)
Mar 07 16:29:33 okd4-bootstrap bootkube.sh[746]: bootkube.service complete
Mar 07 16:29:33 okd4-bootstrap systemd[1]: bootkube.service: Succeeded.
Mar 07 16:29:33 okd4-bootstrap systemd[1]: bootkube.service: Consumed 42.596s CPU time.
I'm tearing down the bootstrap now to see how the cluster stands alone.
The issue yesterday with getting past MachineConfig
was network bandwidth. The guest wireless network that I had my portable lab on, just did not have enough bandwidth for the bootstrap to complete before timing out at 30 minutes.
However, before completion, it did throw some errors. Are these significant?
That's expected - this happens during the switch to persistent control plane.
Did the install complete? Any docs required / small issues to fix in the process?
This install is progressing.
I approved all the CSRs and now the three workers are joining the cluster.
INFO Waiting up to 30m0s for the cluster at https://api.okd4.oscluster.clgcom.org:6443 to initialize...
ERROR Cluster operator authentication Degraded is True with RouteHealth_FailedGet: RouteHealthDegraded: failed to GET route: dial tcp: lookup oauth-openshift.apps.okd4.oscluster.clgcom.org on 172.30.0.10:53: no such host
INFO Cluster operator authentication Progressing is Unknown with NoData:
INFO Cluster operator authentication Available is Unknown with NoData:
INFO Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.4.0-0.okd-2020-03-07-110054
INFO Cluster operator console Available is False with Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for console deployment
INFO Cluster operator insights Disabled is True with Disabled: Health reporting is disabled
INFO Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 5; 0 nodes have achieved new revision 7
INFO Cluster operator support Disabled is True with : Health reporting is disabled
FATAL failed to initialize the cluster: Working towards 4.4.0-0.okd-2020-03-07-110054: 100% complete, waiting on authentication, console
I forgot to approve the CSRs right away, so I ran openshift-install --dir=okd4-install wait-for install-complete
again to continue monitoring.
I'm going to document my whole setup and install process. We can cherry-pick from that for the official docs. The vbmc and iPXE config might be useful to others.
Will Update shortly.
Current status:
INFO Waiting up to 30m0s for the cluster at https://api.okd4.oscluster.clgcom.org:6443 to initialize...
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console
@vrutkovs
Question for you: I used ephemeral storage for the registry.
oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"managementState":"Managed","storage":{"emptyDir":{}}}}'
If I patch it again later to use an iSCSI LUN, will that break anything?
If I patch it again later to use an iSCSI LUN, will that break anything?
Replacing the backend is supported, however I doubt that data would migrate automagically
It's ALIVE!!!
DEBUG OpenShift Installer 4.4.0-0.okd-2020-03-07-110054
DEBUG Built from commit f0d3afed3c4655a6514fdfc54bc40348f0aac80b
INFO Waiting up to 30m0s for the cluster at https://api.okd4.oscluster.clgcom.org:6443 to initialize...
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console
E0307 12:37:18.657292 5569 reflector.go:280] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: Get https://api.okd4.oscluster.clgcom.org:6443/apis/config.openshift.io/v1/clusterversions?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dversion&resourceVersion=36180&timeoutSeconds=338&watch=true: dial tcp 10.11.11.100:6443: connect: connection refused
DEBUG Cluster is initialized
INFO Waiting up to 10m0s for the openshift-console route to be created...
DEBUG Route found in openshift-console namespace: console
DEBUG Route found in openshift-console namespace: downloads
DEBUG OpenShift console route is created
INFO Install complete!
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/root/okd4-lab/okd4-install/auth/kubeconfig'
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.okd4.oscluster.clgcom.org
INFO Login to the console with user: kubeadmin, password: nNoaw-8YrfZ-5Bddf-FwUoa
The last thing that I missed was that the Ingress Routers run on worker nodes. HAProxy was not directing 443
and 80
to worker nodes. It was directing to masters.
I'm going to destroy it and run it again to verify all the steps. Then I'll document my process.
Thanks for being there to help Vadim!
If I patch it again later to use an iSCSI LUN, will that break anything?
Replacing the backend is supported, however I doubt that data would migrate automagically
OK. But, if I do that right after creating the cluster, there's no data in there that is going to be missed by the cluster, correct?
Woot! Feel free to ping me on Slack if you need a live debug session :)
OK. But, if I do that right after creating the cluster, there's no data in there that is going to be missed by the cluster, correct?
Correct, the registry would be empty unless some builds have happened
Will do.
I CAN'T tell you how excited I am to get this going. I'm going to get it set up for my team in our lab at work. We are running three clusters of OCP 3.11 on RHV 4.2 in our data-center. This is our first step to getting them ready for the transition of OCP 4.X. Probably 4.5. I'm also really interested in getting IPI going with KubeVirt. I envision eliminating RHV from the environment and letting OCP manage the whole environment.
I'll close this issue, and post my documentation when I get it written up in a human friendly format. I'll put it in my GitHub account. We can cherry-pick from there, if there is anything useful.
Cluster deployed successfully.
Hello All,
I have a similar issue with the openshift-installer 4.12.20 which has the reference to the release image of 4.12.15. Will it work if I export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE variable with the correct image name that shown at end of oc adm release mirror
Hey everyone,
I am attempting an OKD 4.4 UPI install using pre-allocated libvirt guests, iPXE, and virtual BMC.
I followed guides in the following documents:
https://github.com/openshift/okd/blob/master/Documentation/UPI/libvirt/libvirt.md https://github.com/openshift/installer/blob/master/docs/user/metal/install_upi.md
The guests are initially booting via iPXE, getting reserved IPs with the appropriate DNS A and PTR records, and loading the appropriate ignition configs based on their MAC address.
I have an haproxy server set up for load balancing as described in the UPI docs.
Guests have 20GB of RAM, 4 vCPU, and 100GB /dev/sda
OKD Release: 4.4.0-0.okd-2020-02-28-211838 FCOS Release: fedora-coreos-31.20200210.3.0
Install Config:
FCOS is booting just fine. The ignition configs load on bootstrap, master, and worker nodes, and the bootstrap node begins to install the cluster.
Eventually the bootstrap bootkube.service logs begin looping on the following:
Any suggestions? Google did not hit on any of these errors, so I suspect that I am doing something wrong, or I'm hitting an issue that isn't well known yet.