vmware / vic

vSphere Integrated Containers Engine is a container runtime for vSphere.
http://vmware.github.io/vic
Other
640 stars 173 forks source link

6.0 Nightly failure - 5-21-Datastore-Path #6887

Open andrewtchin opened 6 years ago

andrewtchin commented 6 years ago

Date: 2017 Nov 30 Engine build: 15140 vSphere version: 6.0 Test(s): 5-21-Datastore-Path 5-21-Datastore-Path.zip

CDE during create after uploading ISOs

Nov 30 2017 09:03:41.157-06:00 DEBUG [ END ]  [vic/lib/install/management.(*Dispatcher).applianceConfiguration:755] [221.600369ms] 
Nov 30 2017 09:03:41.157-06:00 INFO  op=4336.2: 
Nov 30 2017 09:03:41.157-06:00 ERROR op=4336.2: Failed to obtain IP address for client interface
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2: Use vic-machine inspect to see if VCH has received an IP address at a later time
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:   State of all interfaces:
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:     "bridge" IP: "waiting for IP"
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:     "public" IP: "waiting for IP"
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:     "client" IP: "waiting for IP"
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:     "management" IP: "waiting for IP"
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:   State of components:
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:     "vicadmin": "<nil>"
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:     "docker-personality": "<nil>"
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:     "port-layer": "<nil>"
Nov 30 2017 09:03:41.158-06:00 DEBUG [ END ]  [vic/lib/install/management.(*Dispatcher).ensureApplianceInitializes:1066] [3m0.421635018s] 
Nov 30 2017 09:03:41.158-06:00 DEBUG [BEGIN]  [vic/lib/install/management.(*Dispatcher).CollectDiagnosticLogs:224]
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.1: Collecting ha-host hostd.log
Nov 30 2017 09:03:41.281-06:00 DEBUG [ END ]  [vic/lib/install/management.(*Dispatcher).CollectDiagnosticLogs:224] [122.740226ms] 
Nov 30 2017 09:03:41.281-06:00 INFO  op=4336.1:     API may be slow to start - try to connect to API after a few minutes:
Nov 30 2017 09:03:41.281-06:00 INFO  op=4336.1:         Run vic-machine-linux inspect to find API connection command and run the command if ip address is ready
Nov 30 2017 09:03:41.281-06:00 INFO  op=4336.1:         If command succeeds, VCH is started. If command fails, VCH failed to install - see documentation for troubleshooting.
Nov 30 2017 09:03:41.281-06:00 ERROR op=4336.1: vic/cmd/vic-machine/create.(*Create).Run.func3: Create error: context deadline exceeded
vic/cmd/vic-machine/create.(*Create).Run:741 Create
vic/cmd/vic-machine/common.NewOperation:27 vic-machine-linux
Nov 30 2017 09:03:41.331-06:00 ERROR op=4336.1: --------------------
Nov 30 2017 09:03:41.331-06:00 ERROR op=4336.1: vic-machine-linux create failed: Creating VCH exceeded time limit of 3m0s. Please increase the timeout using --timeout to accommodate for a busy vSphere target
hickeng commented 6 years ago

As it reports in the vic-machine output the endpointVM was waiting for an IP address:

Nov 30 2017 09:03:41.157-06:00 ERROR op=4336.2: Failed to obtain IP address for client interface
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2: Use vic-machine inspect to see if VCH has received an IP address at a later time
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:   State of all interfaces:
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:     "bridge" IP: "waiting for IP"
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:     "public" IP: "waiting for IP"
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:     "client" IP: "waiting for IP"
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:     "management" IP: "waiting for IP"
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:   State of components:
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:     "vicadmin": "<nil>"
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:     "docker-personality": "<nil>"
Nov 30 2017 09:03:41.158-06:00 INFO  op=4336.2:     "port-layer": "<nil>"

without tether.debug for the endpointVM there's no more diagnosis that can be done. We can hypothesize that:

  1. there was an issue with the DHCP server, or
  2. there was an issue with the DPG that was used for the client/public/management interface (they're all sharing)

Given the toolbox started after boot and there are no heartbeat timeouts reported in the vmware.log it's unlikely that the endpointVM was so resource starved that it couldn't obtain or set an address.

mdubya66 commented 6 years ago

marked as a ship stopper until triaged. @cgtexmex @hickeng feel free to move it out if needed after discussion with @mlh78750

cgtexmex commented 6 years ago

As @hickeng stated there's no further diagnosis that can be done for this issue as we don't have any logs and the environment that experienced the failure was torn down. #6894 has been opened to improve our response to a failure by improving log collection and available diagnostic information.

cc @mhagen-vmware @mlh78750

cgtexmex commented 6 years ago

Since we have no logs and it hasn't reoccurred in the past 5 nightly executions I'm closing this issue.

cgtexmex commented 6 years ago

reopened as a holder / reminder that we need to document / design what logs, etc need to be provided to troubleshoot this or similar issues in the future.

mlh78750 commented 6 years ago

@cgtexmex is https://github.com/vmware/vic/issues/6894 the issue that can act as the placeholder? or were you thinking we needed something else?

cgtexmex commented 6 years ago

@mlh78750 yeah #6894 is the placeholder / epic -- leaving this open more to remind me that I have a specific task to complete - document what would have be helpful for a similar failure...that might result in additional issues being opened to help track / decompose...anyhoo I feel like I'm rambling..

andrewtchin commented 6 years ago

VIC version:

v1.4.0-dev-17605-6144b31

Deployment details:

Nightly 6.0 5-21-Datastore-Path Datastore - Space in Path with Scheme

Logs:

time="2018-03-19T03:41:12Z" level=info msg="Waiting for IP information" 
time="2018-03-19T03:44:12Z" level=error msg="vic/pkg/trace.(*Operation).Err: Create error: context deadline exceeded\nvic/cmd/vic-machine/create.(*Create).Run:744 Create\nvic/cmd/vic-machine/common.NewOperation:27 vic-machine-linux" 
time="2018-03-19T03:44:12Z" level=error msg="vic/lib/install/management.(*Dispatcher).ensureApplianceInitializes: Create error: context deadline exceeded\nvic/cmd/vic-machine/create.(*Create).Run:744 Create\nvic/cmd/vic-machine/common.NewOperation:27 vic-machine-linux" 
time="2018-03-19T03:44:12Z" level=debug msg="properties: [config.extraConfig summary.runtime.connectionState]" 
time="2018-03-19T03:44:12Z" level=error msg="vic/pkg/trace.(*Operation).Err: Create error: context deadline exceeded\nvic/cmd/vic-machine/create.(*Create).Run:744 Create\nvic/cmd/vic-machine/common.NewOperation:27 vic-machine-linux" 
time="2018-03-19T03:44:12Z" level=error msg="Unable to get vm config: Post https://10.161.9.152/sdk: context deadline exceeded" 

https://storage.cloud.google.com/vic-ci-logs/vic_nightly_logs_2018-03-16-20-32-56.zip?authuser=1 (@hickeng edited for correct link)

hickeng commented 6 years ago

The most recent failure timed out waiting for IP because it took 205s to get a DHCP response. We eventually got the following error, then succeeded on the retry:

Mar 17 2018 03:13:41.917Z DEBUG retrying "DHCP request": errno=11, error=resource temporarily unavailable

While looking for others with resource temorarily unavailabile I found i've already diagnosed this same log bundle in #7541. I'm setting the labels back to what @cgtexmex had when this was a placeholder for logging improvements for easier diagnosis of the original failures.