ucrcsedept / galah

An automated grading system geared towards processing computer programming assignments.
Apache License 2.0
42 stars 8 forks source link

Sheep is deadlocking. #334

Open itsjohncs opened 11 years ago

itsjohncs commented 11 years ago

The sheep went mute and would not respond at all to SIGTERM. Died after SIGKILL.

Here is a couple logs before and after the restart.

May 24 13:23:45 legion galah.sheep.consumer-18[XXXX]: Test request received, running tests.
May 24 13:23:45 legion galah.sheep.consumer-26[XXXX]: Test request received, running tests.
May 24 15:41:07 legion galah.sheep.consumer-0[ZZZZ]: Consumer starting.
May 24 15:41:07 legion galah.sheep.consumer-0[ZZZZ]: Waiting for virtual machine to become available...

Not sure what caused this yet.

Before it went completely mute, there was a rerun_test_harness command that went through, worth investigating.

I didn't find anything else in the logs that was particularly useful unfortunately. I didn't find anything to definitely disprove my idea that sheep were gradually getting deadlocked one by one and finally they all went quiet.

Something interesting is that when I killed the sheep finally, the shepherd reported losing track of 14 sheep despite the fact that there's supposed to be 35 of them. So sheep are definitely getting deadlocked gradually (not killed, otherwise the maintainer would start them back up

My best guess is this is related to the OOM killer getting upset with student submissions that take up too much memory. It could also be some faulty timeout mechanism in any of the functions where we poll and wait.

fdavis commented 11 years ago

What is going on here?

May 24 15:41:45 HOSTNAME galah.sheep.universal[98763]: consumer-21's thread aborted with an exception.
Traceback (most recent call last):
  File "/opt/galah/galah/galah/sheep/utility/universal.py", line 66, in newFunc
    zfunc(*zargs, **zkwargs)
  File "/opt/galah/galah/galah/sheep/components/consumer.py", line 37, in run
    _run()
  File "/opt/galah/galah/galah/sheep/components/consumer.py", line 127, in _run
    result = consumer.run_test(machine_id, message.body)
  File "/opt/galah/galah/galah/sheep/virtualsuites/vz/vz.py", line 259, in run_test
    raise RuntimeError("Could not connect to bootstrapper.")
RuntimeError: Could not connect to bootstrapper.
itsjohncs commented 11 years ago

Very good catch. I don't know too many things that could cause that...