vmware / vic

vSphere Integrated Containers Engine is a container runtime for vSphere.
http://vmware.github.io/vic
Other
639 stars 173 forks source link

11-01-Upgrade: Upgrade VCH with unreasonably short timeout and automatic rollback after failure fails sometimes with wrong error #8023

Open mhagen-vmware opened 6 years ago

mhagen-vmware commented 6 years ago

This appears to be a race in the timeout logic. We expect a timeout error but sometimes we get CDE.

time="2018-06-04T15:53:26Z" level=info msg="### Upgrading VCH ####" 
time="2018-06-04T15:53:26Z" level=debug msg="Accepting host \"10.158.214.108\" thumbprint A8:18:69:5D:C9:4E:5F:56:A1:44:E1:62:10:C1:E2:D3:52:E4:B9:C3" 
time="2018-06-04T15:53:26Z" level=debug msg="Creating VMOMI session with thumbprint A8:18:69:5D:C9:4E:5F:56:A1:44:E1:62:10:C1:E2:D3:52:E4:B9:C3" 
time="2018-06-04T15:53:26Z" level=debug msg="Session Environment Info: " API Type=VirtualCenter API Version=6.7 Build=8170161 Name="VMware vCenter Server" OS Type=linux-x64 Product ID=vpx UUID=3201876a-20e5-4c08-980d-f928009e1489 Vendor="VMware, Inc." Version=6.7.0 
time="2018-06-04T15:53:26Z" level=debug msg="vSphere resource cache populating..." 
time="2018-06-04T15:53:26Z" level=debug msg="Cached dc: " 
time="2018-06-04T15:53:26Z" level=debug msg="Cached cluster: /dc1/host/cls" 
time="2018-06-04T15:53:27Z" level=debug msg="Cached pool: " 
time="2018-06-04T15:53:27Z" level=debug msg="Cached folders: " 
time="2018-06-04T15:53:27Z" level=debug msg="Error count populating vSphere cache: (1)" 
time="2018-06-04T15:53:27Z" level=debug msg="new validator Session.Populate: Failure finding ds (): default datastore resolves to multiple instances, please specify" 
time="2018-06-04T15:53:27Z" level=info msg="Validating target" 
time="2018-06-04T15:53:27Z" level=debug msg="properties: [config.extraConfig summary.runtime.connectionState]" 
time="2018-06-04T15:53:27Z" level=debug msg="Cannot find image stores from configuration; attempting to find from vm datastore" 
time="2018-06-04T15:53:27Z" level=debug msg="properties: [datastore summary.runtime.connectionState]" 
time="2018-06-04T15:53:27Z" level=info 
time="2018-06-04T15:53:27Z" level=info msg="VCH ID: VirtualMachine:vm-7035" 
time="2018-06-04T15:53:27Z" level=debug msg="properties: [config.extraConfig summary.runtime.connectionState]" 
time="2018-06-04T15:53:28Z" level=debug msg="properties: [config.extraConfig summary.runtime.connectionState]" 
time="2018-06-04T15:53:28Z" level=debug msg="No need to migrate configuration for \"VirtualMachine:vm-7035\"" 
time="2018-06-04T15:53:28Z" level=debug msg="Cannot find image stores from configuration; attempting to find from vm datastore" 
time="2018-06-04T15:53:28Z" level=debug msg="properties: [datastore summary.runtime.connectionState]" 
time="2018-06-04T15:53:28Z" level=debug msg="properties: [config.files.vmPathName summary.runtime.connectionState]" 
time="2018-06-04T15:53:28Z" level=info msg="Creating directory [datastore1] VCH-19290-1068" 
time="2018-06-04T15:53:28Z" level=info msg="Datastore path is [datastore1] VCH-19290-1068" 
time="2018-06-04T15:53:28Z" level=debug msg="Datacenter: \"dc1\", Cluster: \"ClusterComputeResource:domain-c7\", Resource Pool: \"/dc1/host/cls/Resources\"" 
time="2018-06-04T15:53:28Z" level=debug msg="properties: [config.files.vmPathName summary.runtime.connectionState]" 
time="2018-06-04T15:53:28Z" level=info msg="Uploading ISO images" 
time="2018-06-04T15:53:28Z" level=debug msg="Checking if file already exists: VCH-19290-1068/V1.5.0-DEV-19290-02BA3AD-appliance.iso" 
time="2018-06-04T15:53:28Z" level=debug msg="File not found. Nothing to do." 
time="2018-06-04T15:53:28Z" level=info msg="Uploading appliance.iso as V1.5.0-DEV-19290-02BA3AD-appliance.iso" 
time="2018-06-04T15:53:33Z" level=debug msg="Checking if file already exists: VCH-19290-1068/V1.5.0-DEV-19290-02BA3AD-bootstrap.iso" 
time="2018-06-04T15:53:33Z" level=debug msg="File not found. Nothing to do." 
time="2018-06-04T15:53:33Z" level=info msg="Uploading bootstrap.iso as V1.5.0-DEV-19290-02BA3AD-bootstrap.iso" 
time="2018-06-04T15:53:36Z" level=debug msg="properties: [snapshot summary.runtime.connectionState]" 
time="2018-06-04T15:53:37Z" level=info msg="Switching appliance iso to [datastore1] VCH-19290-1068/V1.5.0-DEV-19290-02BA3AD-appliance.iso" 
time="2018-06-04T15:53:37Z" level=debug msg="generating new config secret key" 
time="2018-06-04T15:53:37Z" level=debug msg="properties: [config.extraConfig summary.runtime.connectionState]" 
time="2018-06-04T15:53:37Z" level=info msg="Setting VM configuration" 
time="2018-06-04T15:53:37Z" level=debug msg="properties: [parentVApp resourcePool summary.runtime.connectionState]" 
time="2018-06-04T15:53:37Z" level=debug msg="vm compute resource: " 
time="2018-06-04T15:53:37Z" level=debug msg="VirtualMachine:vm-7035 resides in a multi-host cluster: true" 
time="2018-06-04T15:53:37Z" level=debug msg="DRS enabled: true" 
time="2018-06-04T15:53:37Z" level=error msg="Failed to upgrade: The operation is not allowed in the current state." 
time="2018-06-04T15:53:37Z" level=info msg="Rolling back upgrade" 
time="2018-06-04T15:53:37Z" level=info msg="Reverting to snapshot reconfigure for 19290" 
time="2018-06-04T15:53:39Z" level=debug msg="vm compute resource: " 
time="2018-06-04T15:53:39Z" level=debug msg="VirtualMachine:vm-7035 resides in a multi-host cluster: true" 
time="2018-06-04T15:53:39Z" level=debug msg="DRS enabled: true" 
time="2018-06-04T15:53:39Z" level=info msg="Waiting for IP information" 
time="2018-06-04T15:53:40Z" level=error msg="vic/pkg/trace.(*Operation).Err: CheckServiceReady during rollback error: context deadline exceeded\nvic/lib/install/management.(*Dispatcher).ensureRollbackReady:384 CheckServiceReady during rollback\nvic/cmd/vic-machine/upgrade.(*Upgrade).Run:106 vic-machine-linux\nvic/cmd/vic-machine/common.NewOperation:27 vic-machine-linux" 
time="2018-06-04T15:53:40Z" level=error msg="vic/lib/install/management.(*Dispatcher).ensureApplianceInitializes: CheckServiceReady during rollback error: context deadline exceeded\nvic/lib/install/management.(*Dispatcher).ensureRollbackReady:384 CheckServiceReady during rollback\nvic/cmd/vic-machine/upgrade.(*Upgrade).Run:106 vic-machine-linux\nvic/cmd/vic-machine/common.NewOperation:27 vic-machine-linux" 
time="2018-06-04T15:53:40Z" level=debug msg="properties: [config.extraConfig summary.runtime.connectionState]" 
time="2018-06-04T15:53:40Z" level=error msg="vic/pkg/trace.(*Operation).Err: CheckServiceReady during rollback error: context deadline exceeded\nvic/lib/install/management.(*Dispatcher).ensureRollbackReady:384 CheckServiceReady during rollback\nvic/cmd/vic-machine/upgrade.(*Upgrade).Run:106 vic-machine-linux\nvic/cmd/vic-machine/common.NewOperation:27 vic-machine-linux" 
time="2018-06-04T15:53:40Z" level=error msg="Unable to get vm config: Post https://10.158.214.108/sdk: context deadline exceeded" 
time="2018-06-04T15:53:40Z" level=debug msg="  State of components:" 
time="2018-06-04T15:53:40Z" level=debug msg="    \"vicadmin\": \"waiting to launch\"" 
time="2018-06-04T15:53:40Z" level=debug msg="    \"docker-personality\": \"waiting to launch\"" 
time="2018-06-04T15:53:40Z" level=debug msg="    \"port-layer\": \"waiting to launch\"" 
time="2018-06-04T15:53:40Z" level=info msg="Obtained IP address for client interface: \"10.158.214.180\"" 
time="2018-06-04T15:53:40Z" level=info msg="Checking VCH connectivity with vSphere target" 
time="2018-06-04T15:53:40Z" level=error msg="vic/pkg/trace.(*Operation).Err: CheckServiceReady during rollback error: context deadline exceeded\nvic/lib/install/management.(*Dispatcher).ensureRollbackReady:384 CheckServiceReady during rollback\nvic/cmd/vic-machine/upgrade.(*Upgrade).Run:106 vic-machine-linux\nvic/cmd/vic-machine/common.NewOperation:27 vic-machine-linux" 
time="2018-06-04T15:53:40Z" level=error msg="vic/pkg/trace.(*Operation).Err: CheckServiceReady during rollback error: context deadline exceeded\nvic/lib/install/management.(*Dispatcher).ensureRollbackReady:384 CheckServiceReady during rollback\nvic/cmd/vic-machine/upgrade.(*Upgrade).Run:106 vic-machine-linux\nvic/cmd/vic-machine/common.NewOperation:27 vic-machine-linux" 
time="2018-06-04T15:53:40Z" level=warning msg="Could not run VCH vSphere API target check due to Failed to get appliance power state, service might not be available at this moment. but the VCH may still function normally" 
time="2018-06-04T15:53:40Z" level=debug msg="Docker API endpoint: https://10.158.214.180:2376/info" 
time="2018-06-04T15:53:40Z" level=error msg="vic/pkg/trace.(*Operation).Err: CheckServiceReady during rollback error: context deadline exceeded\nvic/lib/install/management.(*Dispatcher).ensureRollbackReady:384 CheckServiceReady during rollback\nvic/cmd/vic-machine/upgrade.(*Upgrade).Run:106 vic-machine-linux\nvic/cmd/vic-machine/common.NewOperation:27 vic-machine-linux" 
time="2018-06-04T15:53:40Z" level=error msg="vic/pkg/trace.(*Operation).Err: CheckServiceReady during rollback error: context deadline exceeded\nvic/lib/install/management.(*Dispatcher).ensureRollbackReady:384 CheckServiceReady during rollback\nvic/cmd/vic-machine/upgrade.(*Upgrade).Run:106 vic-machine-linux\nvic/cmd/vic-machine/common.NewOperation:27 vic-machine-linux" 
time="2018-06-04T15:53:40Z" level=debug msg="Unhandled net error type: context.deadlineExceededError{}" 
time="2018-06-04T15:53:40Z" level=info msg="Docker API endpoint check failed: context deadline exceeded" 
time="2018-06-04T15:53:40Z" level=info msg="\tAPI may be slow to start - try to connect to API after a few minutes" 
time="2018-06-04T15:53:40Z" level=info msg="Appliance is rolled back to previous version" 
time="2018-06-04T15:53:40Z" level=info msg="Deleting upgrade isos" ' does not contain 'Upgrading VCH exceeded time limit'
mhagen-vmware commented 6 years ago

The key error appears to be here: time="2018-06-04T15:53:37Z" level=error msg="Failed to upgrade: The operation is not allowed in the current state."

It is coming from here I believe: https://github.com/vmware/vic/blob/master/lib/install/management/configure.go#L344 -> https://github.com/vmware/vic/blob/master/pkg/vsphere/vm/vm.go#L897 -> https://github.com/vmware/vic/blob/master/pkg/vsphere/vm/vm.go#L959

Ultimately the error appears to be originating from somewhere within powerOnDRS, but with the current logging I can't really see where. Likely from the vm.Datacenter.PowerOnVM though.