madecoste / swarming

Automatically exported from code.google.com/p/swarming
Apache License 2.0
0 stars 1 forks source link

Swarming bot must detect stray processes and reboot when detected. #120

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
We currently run tests on our infrastructure that have the following property:
- The test succeeds.
- The test leaves child processes behind.

This scenario cases the next run on this bot to fail to delete temporary 
directories. Each OS will likely to have slightly different way to create the 
heuristic. Just a raw list of running processes may not be sufficient because 
of OS started services.

When this condition is detected, an error must be reported, then the bot should 
reboot.

Original issue reported on code.google.com by maruel@chromium.org on 16 Jun 2014 at 7:23

GoogleCodeExporter commented 9 years ago
In theory af9d837484984559f58d4f774e345ef0093c955a helps but in practice it's 
not enough.

Example with:
- browser_tests: 
https://chromium-swarm-dev.appspot.com/user/task/1477d91b5cc1700
- interactive_ui_tests: 
https://chromium-swarm-dev.appspot.com/user/task/1477c5a33c77000

This implies that "swarm_cleanup.py" needs improvement too. I'll make it use 
the same code path. Zombies are surrounding us.

Original comment by maruel@chromium.org on 28 Jul 2014 at 5:06