rhettg / Tron

Next generation batch process scheduling and management
Other
11 stars 0 forks source link

Failure during reconfig could cause inconsistent service state #66

Open rhettg opened 12 years ago

rhettg commented 12 years ago

Got the following crash following another crash due to misconfiguration:

Unhandled Error
Traceback (most recent call last):
 File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 445, in startReactor
   self.config, oldstdout, oldstderr, self.profiler, reactor)
 File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 348, in runReactorWithLogging
   reactor.run()
 File "/usr/lib/python2.5/site-packages/twisted/internet/base.py", line 1170, in run
   self.mainLoop()
 File "/usr/lib/python2.5/site-packages/twisted/internet/base.py", line 1179, in mainLoop
   self.runUntilCurrent()
---  ---
 File "/usr/lib/python2.5/site-packages/twisted/internet/base.py", line 778, in runUntilCurrent
   call.func(*call.args, **call.kw)
 File "/usr/lib/python2.5/site-packages/tron/service.py", line 255, in _restart_after_failure
   self.start()
 File "/usr/lib/python2.5/site-packages/tron/service.py", line 272, in start
   self.build_instance()
 File "/usr/lib/python2.5/site-packages/tron/service.py", line 303, in build_instance
   node = self.node_pool.next_round_robin()
exceptions.AttributeError: 'NoneType' object has no attribute 'next_round_robin'

Also

Unhandled Error
Traceback (most recent call last):
 File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 445, in startReactor
   self.config, oldstdout, oldstderr, self.profiler, reactor)
 File "/usr/lib/python2.5/site-packages/twisted/application/app.py", line 348, in runReactorWithLogging
   reactor.run()
 File "/usr/lib/python2.5/site-packages/twisted/internet/base.py", line 1170, in run
   self.mainLoop()
 File "/usr/lib/python2.5/site-packages/twisted/internet/base.py", line 1179, in mainLoop
   self.runUntilCurrent()
---  ---
 File "/usr/lib/python2.5/site-packages/twisted/internet/base.py", line 778, in runUntilCurrent
   call.func(*call.args, **call.kw)
 File "/usr/lib/python2.5/site-packages/tron/service.py", line 88, in _run_monitor
   self.machine.transition("monitor")
 File "/usr/lib/python2.5/site-packages/tron/utils/state.py", line 107, in transition
   self.notify()
 File "/usr/lib/python2.5/site-packages/tron/utils/state.py", line 155, in notify
   listener()
 File "/usr/lib/python2.5/site-packages/tron/service.py", line 336, in _instance_change
   self.machine.transition("down")
exceptions.AttributeError: 'NoneType' object has no attribute 'transition'

Should look into why that could ever get in that state. Ideally we'd never have crashes during configuration, but it would be better to have some more defensive measures in place. This was not, btw, one of the jobs or services involved in the prior crash.

rhettg commented 12 years ago

Related issue during another reconfig. Note that the service wasn't being reconfigured:

2011-06-10 11:07:42,132 tron.www INFO Handling reconfig request
2011-06-10 11:07:42,133 tron.mcp INFO Loading configuration from /nail/tron/tron_config.yaml
2011-06-10 11:07:44,654 tron.mcp ERROR Reconfiguration failed
Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/tron/mcp.py", line 195, in live_reconfig
    self.load_config()
  File "/usr/lib/python2.5/site-packages/tron/mcp.py", line 208, in load_config
    configuration.apply(self)
  File "/usr/lib/python2.5/site-packages/tron/config.py", line 294, in apply
    self._apply_services(mcp)
  File "/usr/lib/python2.5/site-packages/tron/config.py", line 254, in _apply_services
    mcp.add_service(new_service)
  File "/usr/lib/python2.5/site-packages/tron/mcp.py", line 281, in add_service
    service.absorb_previous(prev_service)
  File "/usr/lib/python2.5/site-packages/tron/service.py", line 391, in absorb_previous
    optimal_instances_per_node = self.count / len(self.node_pool.nodes)
AttributeError: 'NoneType' object has no attribute 'nodes'
rhettg commented 12 years ago

The side-effects have been mitigated, but the underlying issues still exists in 0.2.5.

Seems related somehow to a node pool being used by both jobs and services. In that, when I removed that from the configuration the problem went away.

Basically the problem actually happens at config time when the node pool for the service somehow becomes None. There is no obvious way this can happen, but could be an interaction with YAML parsing and the fancy 'canonicalization' of tagged entities.