Closed fretboarder closed 6 years ago
It seems to be the same issue I have -> https://github.com/openshift/origin/issues/18115
The problem is, that after deleting the nodes the SECOND RESTART fails consistently.
Are you seeing the error mentioned in https://github.com/openshift/origin/issues/18115#issue-288679927 when the node fails to start the second time?
F0115 16:03:31.681848 122755 network.go:45] SDN node startup failed: failed to get subnet for this host: openshift-tst-master-1, error: timed out waiting for the condition
Exactly.
Jan 17 07:16:58 oo-node-1 origin-node[30642]: F0117 07:16:58.935014 30642 network.go:45] SDN node startup failed: failed to get subnet for this host: oo-node-1, error: timed out waiting for the condition
While doing some more investigations in the meantime I came across https://github.com/Microsoft/openshift-container-platform/blob/master/scripts/deployOpenShift.sh. This is the installation script for OCP (not Origin!!!)
Line 659,660:
runuser -l $SUDOUSER -c "ansible-playbook ~/reboot-master.yml"
runuser -l $SUDOUSER -c "ansible-playbook ~/reboot-nodes.yml"
In fact, rebooting the nodes seems to be a workaround for the moment.
Ok, I'm going to close this issue then as a dup of #18115.
When configuring an OpenShift Origin 3.7 cluster for Azure, the origin-node processes do not come up anymore. The initial installation of the OpenShift cluster works perfectly but after re-configuration according to https://docs.openshift.org/latest/install_config/configuring_azure.html the cluster is broken.
Version
Steps To Reproduce
This means:
The problem is, that after deleting the nodes the SECOND RESTART fails consistently.
The FIRST RESTART seems to work (probably because at this time the node has not been deleted yet???) and after a quite a long time (>60s) the node seems to recover automatically. But, in this case the networking inside the cluster seems to be broken as well, e.g. access to the internal docker registry fails with:
Probably that's a separate issue.
Again, it is important to note that everyhing works well until the Azure re-configuration
Current Result
After re-starting, the origin-node processes fail to come up
Expected Result
After re-starting, the origin-node processes come up and are added to the cluster and the cluster is working again.
Additional Information
Log output of the failing origin-node process
Interestingly, as soon as the Azure configuration is removed from node-config.yaml the origin-node process starts flawlessly:
The only differences I could find in the two cases are:
To me it seems that there's something wrong with the network re-configuration in the cloud provider configuration case.
For completeness, here's my inventory file