Open rhettg opened 12 years ago
Related issue during another reconfig. Note that the service wasn't being reconfigured:
2011-06-10 11:07:42,132 tron.www INFO Handling reconfig request 2011-06-10 11:07:42,133 tron.mcp INFO Loading configuration from /nail/tron/tron_config.yaml 2011-06-10 11:07:44,654 tron.mcp ERROR Reconfiguration failed Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/tron/mcp.py", line 195, in live_reconfig self.load_config() File "/usr/lib/python2.5/site-packages/tron/mcp.py", line 208, in load_config configuration.apply(self) File "/usr/lib/python2.5/site-packages/tron/config.py", line 294, in apply self._apply_services(mcp) File "/usr/lib/python2.5/site-packages/tron/config.py", line 254, in _apply_services mcp.add_service(new_service) File "/usr/lib/python2.5/site-packages/tron/mcp.py", line 281, in add_service service.absorb_previous(prev_service) File "/usr/lib/python2.5/site-packages/tron/service.py", line 391, in absorb_previous optimal_instances_per_node = self.count / len(self.node_pool.nodes) AttributeError: 'NoneType' object has no attribute 'nodes'
The side-effects have been mitigated, but the underlying issues still exists in 0.2.5.
Seems related somehow to a node pool being used by both jobs and services. In that, when I removed that from the configuration the problem went away.
Basically the problem actually happens at config time when the node pool for the service somehow becomes None. There is no obvious way this can happen, but could be an interaction with YAML parsing and the fancy 'canonicalization' of tagged entities.
Got the following crash following another crash due to misconfiguration:
Also
Should look into why that could ever get in that state. Ideally we'd never have crashes during configuration, but it would be better to have some more defensive measures in place. This was not, btw, one of the jobs or services involved in the prior crash.