mesosphere-backup / dcos-windows

Microsoft Windows support to DCOS
Apache License 2.0
12 stars 16 forks source link

DC/OS deployment issue: There weren't at least 1 running task spawned within a timeout of 1800 seconds (Docker service was not available immediately after the Docker service was started) #49

Closed soccerGB closed 6 years ago

soccerGB commented 6 years ago

Symptom:

  Windows agent node was not setup propoerly.
  Agent node setup failure detected in the scale-up scenario, the DC/OS agents count doesn’t
  match the Azure VMs count (transient error). Even though this was hit during the CI scale-up test,
  the same issue could happen to the regular deployment

Log information

/AzureData/DCOSWindowsAgentSetup.log
error during connect: Post http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.36/networks/create: open 
//./pipe/docker_engine:
The system cannot find the file specified. In the default daemon configuration on Windows, 
the docker client must be run elevated to connect. This error may also indicate that the docker 
daemon is not running.

Possible cause:

Race condition from multiple Docker stop/start operations called in back-back sequence.

From the error log, the error comes from the IIS task cannot be launched properly in windows agent node. By looking at the DC/OS cluster log, DCOS tries to launch the task from private agent node. Then from DCOS master UI, it’s because the Windows agent public node does not show in DCOS UI. Log into Windows agent public node, it’s because “nat” driver is not created properly on Windows public agent node. In another word, the Windows public agent node pre-provisioning actually fails. Details at http://dcos-win.westus.cloudapp.azure.com/dcos-testing/546/windows_agents/10.0.0.5/ From the docker deamon log, http://dcos-win.westus.cloudapp.azure.com/dcos-testing/546/windows_agents/10.0.0.5/dockerd.log, I haven’t been able to figure out what’s wrong. Any idea?

Status:

Open
soccerGB commented 6 years ago

resolved with workaround

ionutbalutoiu commented 6 years ago

Look like a recent job run (https://mesos-jenkins.westus.cloudapp.azure.com/job/dcos-testing/546/console) has this issue. This is transient just like in the past.

Re-opened this issue for further tracking.

soccerGB commented 6 years ago

From: Li Li . The Windows agent setup did the below currently,

//start dockerd by default by system //stop dockerd Disable-DockerDefaultNATNetwork // start dockerd  take ~2 mins // stop dockerd Update-Docker //start dockerd -> take ~2 mins // create a custom NAT --- fail at the race condition where dockerd has not been started yet. New-DockerNATNetwork

Once dockerd is started, stopping dockerd is async. So the next dockerd start will hit the below error until the old dockerd instance fully stopped. Unfortunately, it takes Windows docker about 2 mins to stop the existing dockerd instance sometimes. Error starting daemon: pid file found, ensure docker is not running or delete C:\ProgramData\docker\docker.pid

Since we have multiple start/stop dockerd in our current Windows agent deployment script, that’s why we hit this race condition from time to time. Some improvements that we can add,

Combine pre-config dockerd service, like DisableDefaultNATNetwork, update docker versions, etc. Figure out if there is a way by default stop docker service. Ping Dockerd to make sure it is ready before creating a custom NAT.

soccerGB commented 6 years ago

This issue was addressed by the changes in https://github.com/dcos/dcos-windows/pull/57