soccerGB commented 6 years ago

Symptom (reportped by Li)

leader.mesos A 60 Answer 192.168.255.6

Successfully resolved leader.mesos from DC/OS Windows slave 10.1.0.5 Trying to resolve master.mesos on Windows agent: 10.1.0.5 Name Type TTL Section IPAddress

master.mesos A 60 Answer 192.168.255.5 master.mesos A 60 Answer 192.168.255.7 master.mesos A 60 Answer 192.168.255.6

Successfully resolved master.mesos from DC/OS Windows slave 10.1.0.5 Deploying a Windows Marathon application on DC/OS Created deployment 5be11504-400f-4be1-830e-fe7443e82305 Trying to find 1 running tasks within a timeout of 1800 seconds. Traceback (most recent call last): File "/home/jenkins/workspace/workspace/dcos-testing@2/mesos-jenkins/DCOS/utils/check-marathon-app-health.py", line 114, in main() File "/home/jenkins/workspace/workspace/dcos-testing@2/mesos-jenkins/DCOS/utils/check-marathon-app-health.py", line 83, in main running_tasks = get_running_tasks(client, app["id"], app["instances"]) File "/home/jenkins/workspace/workspace/dcos-testing@2/mesos-jenkins/DCOS/utils/check-marathon-app-health.py", line 40, in get_running_tasks timeout)) Exception: There weren't at least 1 running task spawned within a timeout of 1800 seconds ERROR: Failed to get test-windows-app application health checks Collecting logs from all the DC/OS nodes Collecting Linux master logs Collecting Linux agents logs Collecting Windows agents logs % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed

Initial investigation from Li:

From the error log, the error comes from the IIS task cannot be launched properly in windows agent node. By looking at the DC/OS cluster log, DCOS tries to launch the task from private agent node.
Then from DCOS master UI, it’s because the Windows agent public node does not show in DCOS UI.
Log into Windows agent public node, it’s because “nat” driver is not created properly on Windows public agent node. In another word, the Windows public agent node pre-provisioning actually fails. Details at http://dcos-win.westus.cloudapp.azure.com/dcos-testing/546/windows_agents/10.0.0.5/
From the docker deamon log, http://dcos-win.westus.cloudapp.azure.com/dcos-testing/546/windows_agents/10.0.0.5/dockerd.log, I haven’t been able to figure out what’s wrong. Any idea?

Possible cause:

Race condition from multiple Docker stop/start operations called in back-back sequence.

soccerGB commented 6 years ago

From: Ionut Balutoiu (Cloudbase Solutions SRL)

It’s the exact issue we discussed in the past: https://github.com/dcos/dcos-windows/issues/49

We agreed that the cleanest solution is having a patch to Docker so that the Docker API is not open until Dockerd itself finishes its whole initialization. However, this is interesting. I ended up adding a retry, thinking that the second request will go through and create the network: https://github.com/dcos/dcos-windows/blob/master/scripts/DCOSWindowsAgentSetup.ps1#L207-L212

Judging by http://dcos-win.westus.cloudapp.azure.com/dcos-testing/546/windows_agents/10.0.0.5/AzureData/DCOSWindowsAgentSetup.log, It seems like it failed at every retry. We retry 10 times with 3 seconds delay between every retry.

This is interesting since you could successfully create the network after logging into the node. I’m wondering why the script couldn’t create the network on every retry, but the manual creation succeeded

soccerGB commented 6 years ago

From: Li Li . The Windows agent setup did the below currently,

//start dockerd by default by system //stop dockerd Disable-DockerDefaultNATNetwork // start dockerd  take ~2 mins // stop dockerd Update-Docker //start dockerd -> take ~2 mins // create a custom NAT --- fail at the race condition where dockerd has not been started yet. New-DockerNATNetwork

Once dockerd is started, stopping dockerd is async. So the next dockerd start will hit the below error until the old dockerd instance fully stopped. Unfortunately, it takes Windows docker about 2 mins to stop the existing dockerd instance sometimes. Error starting daemon: pid file found, ensure docker is not running or delete C:\ProgramData\docker\docker.pid

Since we have multiple start/stop dockerd in our current Windows agent deployment script, that’s why we hit this race condition from time to time. Some improvements that we can add,

Combine pre-config dockerd service, like DisableDefaultNATNetwork, update docker versions, etc.
Figure out if there is a way by default stop docker service.
Ping Dockerd to make sure it is ready before creating a custom NAT.

soccerGB commented 6 years ago

This is a duplicate of https://github.com/dcos/dcos-windows/issues/49

mesosphere-backup / dcos-windows

DCOS cluster test failure (CI): There weren't at least 1 running task spawned within a timeout of 1800 seconds #56

Symptom (reportped by Li)

Initial investigation from Li:

Possible cause: