rhettg / Tron

Next generation batch process scheduling and management
Other
11 stars 0 forks source link

Machine failure during monitor results in crash #75

Closed rhettg closed 12 years ago

rhettg commented 12 years ago

Lost all our slwdc hosts and that caught a monitor in a bad state.

Unhandled Error
Traceback (most recent call last):
 File "/usr/lib/python2.5/site-packages/twisted/conch/ssh/transport.py", line 165, in connectionLost
   self.service.serviceStopped()
 File "/usr/lib/python2.5/site-packages/tron/ssh.py", line 78, in serviceStopped
   self.service_stop_defer.callback(self)
 File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 280, in callback
   self._startRunCallbacks(result)
 File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 354, in _startRunCallbacks
   self._runCallbacks()
---  ---
 File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 371, in _runCallbacks
   self.result = callback(self.result, *args, **kw)
 File "/usr/lib/python2.5/site-packages/tron/node.py", line 201, in _service_stopped
   raise Error("Run %s in state %s when service stopped", run_id, run.state)
tron.node.Error: ('Run %s in state %s when service stopped', 'trigger_finger_slwdc.0.monitor', 5)
rhettg commented 12 years ago

I think this one is mostly harmless, but I've code now to handle this case. We handle the case where we fail and we're waiting to connect. But this happens if we have connected, but before we can start our channel service shuts down. There is a timer that should eventually mark this as a failure. The fix will be to short-circuit that timer rather than crash like this.