Closed soccerGB closed 6 years ago
resolved with workaround
Look like a recent job run (https://mesos-jenkins.westus.cloudapp.azure.com/job/dcos-testing/546/console) has this issue. This is transient just like in the past.
Re-opened this issue for further tracking.
From: Li Li . The Windows agent setup did the below currently,
//start dockerd by default by system //stop dockerd Disable-DockerDefaultNATNetwork // start dockerd take ~2 mins // stop dockerd Update-Docker //start dockerd -> take ~2 mins // create a custom NAT --- fail at the race condition where dockerd has not been started yet. New-DockerNATNetwork
Once dockerd is started, stopping dockerd is async. So the next dockerd start will hit the below error until the old dockerd instance fully stopped. Unfortunately, it takes Windows docker about 2 mins to stop the existing dockerd instance sometimes. Error starting daemon: pid file found, ensure docker is not running or delete C:\ProgramData\docker\docker.pid
Since we have multiple start/stop dockerd in our current Windows agent deployment script, that’s why we hit this race condition from time to time. Some improvements that we can add,
Combine pre-config dockerd service, like DisableDefaultNATNetwork, update docker versions, etc. Figure out if there is a way by default stop docker service. Ping Dockerd to make sure it is ready before creating a custom NAT.
This issue was addressed by the changes in https://github.com/dcos/dcos-windows/pull/57
Symptom:
Log information
Possible cause:
From the error log, the error comes from the IIS task cannot be launched properly in windows agent node. By looking at the DC/OS cluster log, DCOS tries to launch the task from private agent node. Then from DCOS master UI, it’s because the Windows agent public node does not show in DCOS UI. Log into Windows agent public node, it’s because “nat” driver is not created properly on Windows public agent node. In another word, the Windows public agent node pre-provisioning actually fails. Details at http://dcos-win.westus.cloudapp.azure.com/dcos-testing/546/windows_agents/10.0.0.5/ From the docker deamon log, http://dcos-win.westus.cloudapp.azure.com/dcos-testing/546/windows_agents/10.0.0.5/dockerd.log, I haven’t been able to figure out what’s wrong. Any idea?
Status: