openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.44k stars 1.38k forks source link

Failed IPI install on GCP #2747

Closed Jamstah closed 4 years ago

Jamstah commented 4 years ago

Version

$ openshift-install version
./openshift-install v4.2.8
built from commit 425e4ff0037487e32571258640b39f56d5ee5572
release image quay.io/openshift-release-dev/ocp-release@sha256:4bf307b98beba4d42da3316464013eac120c6e5a398646863ef92b0e2c621230

Platform:

GCP

What happened?

Enter text here. Installer fails to initialise operators:

time="2019-12-04T09:18:30-05:00" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring"
time="2019-12-04T09:21:45-05:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.8: 99% complete"
time="2019-12-04T09:22:59-05:00" level=fatal msg="failed to initialize the cluster: Working towards 4.2.8: 99% complete"

What you expected to happen?

Successful install

How to reproduce it (as minimally and precisely as possible)?

Run the install, no changes were made to the install config - default 3 masters 3 workers.

Anything else we need to know?

Did some debug, it seems to have made machines:

[jammy@ibm007470 Downloads]$ oc --config kubeconfig get machine -n openshift-machine-api
NAME                   STATE     TYPE            REGION        ZONE            AGE
cp4i-xp2kl-m-0         RUNNING   n1-standard-4   us-central1   us-central1-a   90m
cp4i-xp2kl-m-1         RUNNING   n1-standard-4   us-central1   us-central1-b   90m
cp4i-xp2kl-m-2         RUNNING   n1-standard-4   us-central1   us-central1-c   90m
cp4i-xp2kl-w-a-lxr6t   RUNNING   n1-standard-4   us-central1   us-central1-a   89m
cp4i-xp2kl-w-b-5dldc   RUNNING   n1-standard-4   us-central1   us-central1-b   89m
cp4i-xp2kl-w-c-7bbvv   RUNNING   n1-standard-4   us-central1   us-central1-c   89m

But the nodes don't appear:

[jammy@ibm007470 Downloads]$ oc --config kubeconfig get nodes
NAME                                                         STATUS   ROLES    AGE   VERSION
cp4i-xp2kl-m-0.us-central1-a.c.starship-techsales.internal   Ready    master   90m   v1.14.6+6ac6aa4b0
cp4i-xp2kl-m-1.us-central1-b.c.starship-techsales.internal   Ready    master   91m   v1.14.6+6ac6aa4b0
cp4i-xp2kl-m-2.us-central1-c.c.starship-techsales.internal   Ready    master   91m   v1.14.6+6ac6aa4b0

nodelink-controller obviously doesn't find the node to link:

[jammy@ibm007470 Downloads]$ oc --config kubeconfig logs machine-api-controllers-59794c996-7wbjn -n openshift-machine-api -c nodelink-controller | grep xp2kl-w-a
I1204 13:42:58.864878       1 nodelink_controller.go:334] Finding node from machine "cp4i-xp2kl-w-a-lxr6t"
I1204 13:42:58.865693       1 nodelink_controller.go:351] Finding node from machine "cp4i-xp2kl-w-a-lxr6t" by providerID
W1204 13:42:58.865775       1 nodelink_controller.go:353] Machine "cp4i-xp2kl-w-a-lxr6t" has no providerID
I1204 13:42:58.865819       1 nodelink_controller.go:375] Finding node from machine "cp4i-xp2kl-w-a-lxr6t" by IP
W1204 13:42:58.865979       1 nodelink_controller.go:386] not found internal IP for machine "cp4i-xp2kl-w-a-lxr6t"
I1204 13:42:58.866019       1 nodelink_controller.go:328] No-op: Node for machine "cp4i-xp2kl-w-a-lxr6t" not found

I don't know what is supposed to create the node, but happy to dig out more logs. I'm assuming the operators don't initialise because there are no worker nodes to run them on.

abhinavdahiya commented 4 years ago

Can you run oc adm must-gather and provide the bundle captured from it.

the important logs i'm instrested in are deployment openshift-cluster-machine-approver/cluster-machine-approver

jconallen commented 4 years ago

Here are the logs from the po machine-approver-579bd55c89-mhgh4. I ran the must-gather, but it will take a little while before I can verify there is no PI in there.

logs.txt

abhinavdahiya commented 4 years ago

Here are the logs from the po machine-approver-579bd55c89-mhgh4. I ran the must-gather, but it will take a little while before I can verify there is no PI in there.

logs.txt

didn't find anything suspicious in the machine-approver-logs. the must-gather logs should provide details...

You can also open a Bugzilla using https://bugzilla.redhat.com/enter_bug.cgi?product=OpenShift%20Container%20Platform . That has the benefit that we can easily request help from various operator teams and also upload somewhat private information(only visible to Red Hat group)

There is already a bug similar to your https://bugzilla.redhat.com/show_bug.cgi?id=1779866

jconallen commented 4 years ago

Yes, that bug is exactly what we are seeing. I have to create an account yet, to comment on that bug. But will use it to upload must-have data if requested.

abhinavdahiya commented 4 years ago

Yes, that bug is exactly what we are seeing. I have to create an account yet, to comment on that bug. But will use it to upload must-have data if requested.

Closing in favor of the BZ.

openshift-ci-robot commented 4 years ago

@abhinavdahiya: Closing this issue.

In response to [this](https://github.com/openshift/installer/issues/2747#issuecomment-580887245): >> Yes, that bug is exactly what we are seeing. I have to create an account yet, to comment on that bug. But will use it to upload must-have data if requested. > >Closing in favor of the BZ. > > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.