Open itsjohncs opened 11 years ago
What is going on here?
May 24 15:41:45 HOSTNAME galah.sheep.universal[98763]: consumer-21's thread aborted with an exception.
Traceback (most recent call last):
File "/opt/galah/galah/galah/sheep/utility/universal.py", line 66, in newFunc
zfunc(*zargs, **zkwargs)
File "/opt/galah/galah/galah/sheep/components/consumer.py", line 37, in run
_run()
File "/opt/galah/galah/galah/sheep/components/consumer.py", line 127, in _run
result = consumer.run_test(machine_id, message.body)
File "/opt/galah/galah/galah/sheep/virtualsuites/vz/vz.py", line 259, in run_test
raise RuntimeError("Could not connect to bootstrapper.")
RuntimeError: Could not connect to bootstrapper.
Very good catch. I don't know too many things that could cause that...
The sheep went mute and would not respond at all to SIGTERM. Died after SIGKILL.
Here is a couple logs before and after the restart.
Not sure what caused this yet.
Before it went completely mute, there was a
rerun_test_harness
command that went through, worth investigating.I didn't find anything else in the logs that was particularly useful unfortunately. I didn't find anything to definitely disprove my idea that sheep were gradually getting deadlocked one by one and finally they all went quiet.
Something interesting is that when I killed the sheep finally, the shepherd reported losing track of 14 sheep despite the fact that there's supposed to be 35 of them. So sheep are definitely getting deadlocked gradually (not killed, otherwise the maintainer would start them back up
My best guess is this is related to the OOM killer getting upset with student submissions that take up too much memory. It could also be some faulty timeout mechanism in any of the functions where we poll and wait.