Closed soccerGB closed 6 years ago
From: Ionut Balutoiu (Cloudbase Solutions SRL)
It’s the exact issue we discussed in the past: https://github.com/dcos/dcos-windows/issues/49
We agreed that the cleanest solution is having a patch to Docker so that the Docker API is not open until Dockerd itself finishes its whole initialization. However, this is interesting. I ended up adding a retry, thinking that the second request will go through and create the network: https://github.com/dcos/dcos-windows/blob/master/scripts/DCOSWindowsAgentSetup.ps1#L207-L212
Judging by http://dcos-win.westus.cloudapp.azure.com/dcos-testing/546/windows_agents/10.0.0.5/AzureData/DCOSWindowsAgentSetup.log, It seems like it failed at every retry. We retry 10 times with 3 seconds delay between every retry.
This is interesting since you could successfully create the network after logging into the node. I’m wondering why the script couldn’t create the network on every retry, but the manual creation succeeded
From: Li Li . The Windows agent setup did the below currently,
//start dockerd by default by system //stop dockerd Disable-DockerDefaultNATNetwork // start dockerd take ~2 mins // stop dockerd Update-Docker //start dockerd -> take ~2 mins // create a custom NAT --- fail at the race condition where dockerd has not been started yet. New-DockerNATNetwork
Once dockerd is started, stopping dockerd is async. So the next dockerd start will hit the below error until the old dockerd instance fully stopped. Unfortunately, it takes Windows docker about 2 mins to stop the existing dockerd instance sometimes. Error starting daemon: pid file found, ensure docker is not running or delete C:\ProgramData\docker\docker.pid
Since we have multiple start/stop dockerd in our current Windows agent deployment script, that’s why we hit this race condition from time to time. Some improvements that we can add,
This is a duplicate of https://github.com/dcos/dcos-windows/issues/49
Symptom (reportped by Li)
leader.mesos A 60 Answer 192.168.255.6
Successfully resolved leader.mesos from DC/OS Windows slave 10.1.0.5 Trying to resolve master.mesos on Windows agent: 10.1.0.5 Name Type TTL Section IPAddress
master.mesos A 60 Answer 192.168.255.5 master.mesos A 60 Answer 192.168.255.7 master.mesos A 60 Answer 192.168.255.6
Successfully resolved master.mesos from DC/OS Windows slave 10.1.0.5 Deploying a Windows Marathon application on DC/OS Created deployment 5be11504-400f-4be1-830e-fe7443e82305 Trying to find 1 running tasks within a timeout of 1800 seconds. Traceback (most recent call last): File "/home/jenkins/workspace/workspace/dcos-testing@2/mesos-jenkins/DCOS/utils/check-marathon-app-health.py", line 114, in main() File "/home/jenkins/workspace/workspace/dcos-testing@2/mesos-jenkins/DCOS/utils/check-marathon-app-health.py", line 83, in main running_tasks = get_running_tasks(client, app["id"], app["instances"]) File "/home/jenkins/workspace/workspace/dcos-testing@2/mesos-jenkins/DCOS/utils/check-marathon-app-health.py", line 40, in get_running_tasks timeout)) Exception: There weren't at least 1 running task spawned within a timeout of 1800 seconds ERROR: Failed to get test-windows-app application health checks Collecting logs from all the DC/OS nodes Collecting Linux master logs Collecting Linux agents logs Collecting Windows agents logs % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed
Initial investigation from Li:
Possible cause: