microsoft / openshift-container-platform

OpenShift Container Platform on Azure
MIT License
135 stars 197 forks source link

OpenShift deploy fails during "Rebooting cluster to complete installation" #59

Closed mavis1827 closed 6 years ago

mavis1827 commented 6 years ago

I'm working on building a POC of OpenShift (OCP) on Azure for one of our customer. I ran into multiple issues (#48, 51 & 53) and i bypassed them by following the workarounds suggested and with the help of @dwaiba. Now i run into below issue:

New-AzureRmResourceGroupDeployment : 5:30:59 PM - Resource Microsoft.Resources/deployments 'OpenShiftDeployment' failed with message '{ "status": "Failed", "error": { "code": "ResourceDeploymentFailure", "message": "The resource operation completed with terminal provisioning state 'Failed'.", "details": [ { "code": "DeploymentFailed", "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details.", "details": [ { "code": "Conflict", "message": "{\r\n "status": "Failed",\r\n "error": {\r\n "code": "ResourceDeploymentFailure",\r\n "message": "The resource operation completed with terminal provisioning state 'Failed'.",\r\n "details": [\r\n {\r\n "code": "VMExtensionProvisioningError",\r\n "message": "VM has reported a failure when processing extension 'deployOpenShift'. Error message: \"Enable failed: failed to execute command: command terminated with exit status=1\n[stdout]\nonf']})\nchanged: [ocpcluster-master-2] => (item={u'key': u'kubeletArguments.cloud-provider', u'value': [u'azure']})\n\nRUNNING HANDLER [restart atomic-openshift-node] ****\nchanged: [ocpcluster-master-2]\n\nPLAY RECAP *****\nocpcluster-master-0 : ok=4 changed=2 unreachable=0 failed=0 \nocpcluster-master-1 : ok=4 changed=2 unreachable=0 failed=0 \nocpcluster-master-2 : ok=4 changed=2 unreachable=0 failed=0 \n\nThu Mar 22 00:27:10 UTC 2018 - Cloud Provider setup of node config on Master Nodes completed successfully\nThu Mar 22 00:27:10 UTC 2018 - Sleep for 60\n\nPLAY [nodes:!masters] **\n\nTASK [make sure /etc/azure exists] *\nchanged: [ocpcluster-infra-0]\n\nTASK [populate /etc/azure/azure.conf] **\nchanged: [ocpcluster-infra-0]\n\nTASK [insert the azure disk config into the node] **\nchanged: [ocpcluster-infra-0] => (item={u'key': u'kubeletArguments.cloud-config', u'value': [u'/etc/azure/azure.conf']})\nchanged: [ocpcluster-infra-0] => (item={u'key': u'kubeletArguments.cloud-provider', u'value': [u'azure']})\n\nRUNNING HANDLER [restart atomic-openshift-node] ****\nchanged: [ocpcluster-infra-0]\n\nPLAY [nodes:!masters] **\n\nTASK [make sure /etc/azure exists] *\nchanged: [ocpcluster-infra-1]\n\nTASK [populate /etc/azure/azure.conf] **\nchanged: [ocpcluster-infra-1]\n\nTASK [insert the azure disk config into the node] **\nchanged: [ocpcluster-infra-1] => (item={u'key': u'kubeletArguments.cloud-config', u'value': [u'/etc/azure/azure.conf']})\nchanged: [ocpcluster-infra-1] => (item={u'key': u'kubeletArguments.cloud-provider', u'value': [u'azure']})\n\nRUNNING HANDLER [restart atomic-openshift-node] ****\nchanged: [ocpcluster-infra-1]\n\nPLAY [nodes:!masters] **\n\nTASK [make sure /etc/azure exists] *\nchanged: [ocpcluster-node-0]\n\nTASK [populate /etc/azure/azure.conf] **\nchanged: [ocpcluster-node-0]\n\nTASK [insert the azure disk config into the node] **\nchanged: [ocpcluster-node-0] => (item={u'key': u'kubeletArguments.cloud-config', u'value': [u'/etc/azure/azure.conf']})\nchanged: [ocpcluster-node-0] => (item={u'key': u'kubeletArguments.cloud-provider', u'value': [u'azure']})\n\nRUNNING HANDLER [restart atomic-openshift-node] ****\nchanged: [ocpcluster-node-0]\n\nPLAY RECAP ****\nocpcluster-infra-0 : ok=4 changed=4 unreachable=0 failed=0 \nocpcluster-infra-1 : ok=4 changed=4 unreachable=0 failed=0 \nocpcluster-node-0 : ok=4 changed=4 unreachable=0 failed=0 \n\nThu Mar 22 00:28:23 UTC 2018 - Cloud Provider setup of node config on App Nodes completed successfully\nThu Mar 22 00:28:23 UTC 2018 - Sleep for 120\n\nPLAY [masters] \n\nTASK [set masters as unschedulable] ****\nchanged: [ocpcluster-master-0]\nchanged: [ocpcluster-master-2]\nchanged: [ocpcluster-master-1]\n\nPLAY RECAP *****\nocpcluster-master-0 : ok=1 changed=1 unreachable=0 failed=0 \nocpcluster-master-1 : ok=1 changed=1 unreachable=0 failed=0 \nocpcluster-master-2 : ok=1 changed=1 unreachable=0 failed=0 \n\nThu Mar 22 00:30:25 UTC 2018 - Cloud Provider setup of OpenShift Cluster completed successfully\nThu Mar 22 00:30:25 UTC 2018 - Rebooting cluster to complete installation\n\n[stderr]\n % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 6 100 6 0 0 590 0 --:--:-- --:--:-- --:--:-- 666\n[DEPRECATION WARNING]: 'include' for playbook includes. You should use \n'import_playbook' instead. This feature will be removed in version 2.8. \nDeprecation warnings can be disabled by setting deprecation_warnings=False in \nansible.cfg.\n[DEPRECATION WARNING]: 'include' for playbook includes. You should use \n'import_playbook' instead. This feature will be removed in version 2.8. \nDeprecation warnings can be disabled by setting deprecation_warnings=False in \nansible.cfg.\n[DEPRECATION WARNING]: The use of 'static' for 'include_role' has been \ndeprecated. Use 'import_role' for static inclusion, or 'include_role' for \ndynamic inclusion. This feature will be removed in a future release. \nDeprecation warnings can be disabled by setting deprecation_warnings=False in \nansible.cfg.\n[DEPRECATION WARNING]: The use of 'include' for tasks has been deprecated. Use \n'import_tasks' for static inclusions or 'include_tasks' for dynamic inclusions.\n This feature will be removed in a future release. Deprecation warnings can be \ndisabled by setting deprecation_warnings=False in ansible.cfg.\n[DEPRECATION WARNING]: include is kept for backwards compatibility but usage is\n discouraged. The module documentation details page may explain more about this\n rationale.. This feature will be removed in a future release. Deprecation \nwarnings can be disabled by setting deprecation_warnings=False in ansible.cfg.\n[DEPRECATION WARNING]: The use of 'static' has been deprecated. Use \n'import_tasks' for static inclusion, or 'include_tasks' for dynamic inclusion. \nThis feature will be removed in a future release. Deprecation warnings can be \ndisabled by setting deprecation_warnings=False in ansible.cfg.\n [WARNING]: Could not match supplied host pattern, ignoring: oo_all_hosts\n [WARNING]: Could not match supplied host pattern, ignoring: oo_lb_to_config\n [WARNING]: Could not match supplied host pattern, ignoring: oo_nfs_to_config\n [WARNING]: Consider using yum, dnf or zypper module rather than running rpm\n [WARNING]: Consider using unarchive module rather than running tar\n [WARNING]: Consider using get_url or uri module rather than running curl\n [WARNING]: Could not match supplied host pattern, ignoring:\noo_containerized_master_nodes\n [WARNING]: Could not match supplied host pattern, ignoring:\noo_nodes_use_flannel\n [WARNING]: Could not match supplied host pattern, ignoring:\noo_nodes_use_calico\n [WARNING]: Could not match supplied host pattern, ignoring:\noo_nodes_use_contiv\n [WARNING]: Could not match supplied host pattern, ignoring: oo_nodes_use_kuryr\n [WARNING]: Could not match supplied host pattern, ignoring: oo_nodes_use_nuage\n [WARNING]: Could not match supplied host pattern, ignoring: glusterfs\n [WARNING]: Could not match supplied host pattern, ignoring: glusterfs_registry\n [WARNING]: Module did not set no_log for stats_password\n [WARNING]: Module did not set no_log for external_host_password\nWarning: Permanently added 'ocpcluster-master-0,10.1.0.9' (ECDSA) to the list of known hosts.\r\nerror: 'openshift-infra' already has a value (apiserver), and --overwrite is false\n\"."\r\n }\r\n ]\r\n }\r\n}" } ] } ] } }' At line:1 char:1

New-AzureRmResourceGroupDeployment -Name OCPDeploy -ResourceGroupName ...

New-AzureRmResourceGroupDeployment : 5:30:59 PM - At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details. At line:1 char:1

New-AzureRmResourceGroupDeployment -Name OCPDeploy -ResourceGroupName ...

New-AzureRmResourceGroupDeployment : 5:30:59 PM - Template output evaluation skipped: at least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details. At line:1 char:1

New-AzureRmResourceGroupDeployment -Name OCPDeploy -ResourceGroupName ...

New-AzureRmResourceGroupDeployment : 5:30:59 PM - Template output evaluation skipped: at least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details. At line:1 char:1

New-AzureRmResourceGroupDeployment -Name OCPDeploy -ResourceGroupName ...

Details: Deployment mode: Powershell Command: PS C:\WINDOWS\system32> New-AzureRmResourceGroupDeployment -Name OCPDeploy -ResourceGroupName OCPRG -TemplateFile C:\Users\mandava\openshift-container-platform-master\azuredeploy.json -TemplateParameterFile C:\Users\mandava\openshift-container-platform-master\azuredeploy .parameters.json Docker version: 1.12.6 OpenShift version: 3.7 Instructions followed from: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/openshift-prerequisites

I did looked into logs in folders "0" and "1" on bastion node and i did not find any more details other than what's in the failure. I did not find anything wrong in deployOpenShift.sh script and i do not see a parameter "--overwrite" to change the value to "true". Attaching logs from "0" & "1" folders for reference along with the scripts i'm using. Please suggest a solution for this failure. Thank you!

mavis1827 commented 6 years ago

@miosman Can you please help me with this issue?

miosman commented 6 years ago

What I can see from the logs is that the error is in: error: 'openshift-infra' already has a value (apiserver), and --overwrite is false which corresponds to the command runuser -l $SUDOUSER -c "oc label nodes $MASTER-0 openshift-infra=apiserver" in deployOpenshift.sh

It seems you are were retrying the deployment after the deployment failed at a later step and that's why the label is already applied. Try removing all the resources and recreating the environment with the fixes applied.

mavis1827 commented 6 years ago

@miosman Ok. Every attempt i make is from scratch. After every failure i delete the RG and start all over again.

However, i have made several changes to my scripts (change docker version from 1.12.6 to 1.13.1, grant permission to SP at subscription level and changing the defaultValue URL path) and the deployment is successful now. So, i think we can close this issue.