ocp-power-automation / ocp4-upi-powervs

OpenShift on Power Virtual Server
Apache License 2.0
22 stars 46 forks source link

Could not deploy OCP in lon06 zone #143

Closed gitsridhar closed 3 years ago

gitsridhar commented 3 years ago

I could not deploy OCP 4.6 in London06 zone. All nodes except bastion are in a reboot loop, failing to download ign file.

bpradipt commented 3 years ago

@gitsridhar have you used the right RHCOS image with the right code base (eg. rhcos 4.5 with release-4.5 )

gitsridhar commented 3 years ago

@bpradipt I need 4.6, so used release-4.6 branch of the repo with newly uploaded coreos 4.6 image (rhcos-4.6.1-ppc64le).

lsmcfadden commented 3 years ago

What is the next step here?

yussufsh commented 3 years ago

We need the console messages (bootstrap node to start with) to find out what is the reason of not picking the ign files. Could be DHCP issue, ignition version, disk failure, etc.

gitsridhar commented 3 years ago

Yussuf, in lon06, all nodes (bootstrap, master and worker nodes) are in a loop. In bootstrap, it reboots in a loop and in bootstrap as well as master/worker nodes I see this message: 'A start job is running for Ignition (fetch-offline)' and this goes on for 5 minutes followed by another 5 minute silence followed by reboot.

This is the only cluster in lon06, can you access it and look?

yussufsh commented 3 years ago

This is the only cluster in lon06, can you access it and look?

I do not have access to your resource group. Please check for error messages why the node is going into emergency reboot?

gitsridhar commented 3 years ago

Yussuf, I have access to console of these nodes from web interface, how else can I check the error messages? I copy pasted what I could from web console of these failing nodes above.

yussufsh commented 3 years ago

Can you please paste the screen shot of the tailing console messages? The part just before it says going into Emergency mode. If nothing then we need to get into a webex and check the errors.

yussufsh commented 3 years ago

The issue here was that the node was in grub prompt. A reboot usually works or deleting the node and creating again will help if the config drive is not read properly.