crash on recursion depth for locked resources

wcpettus commented 7 years ago

This is a case study in what happens when an identical service tries to start when it is already running somewhere else. It happened this weekend because the multido (status_multigets.yaml) service was running on my account to patch dead endpoints, and Mathieu tried to restart all the services to roll back releases without killing my instance.

Everything starts fine, but then service can't use the top-level name "status_multido" because it is already bound to a running service:

2016-11-19T22:30:47[INFO    ] dragonfly.subcommands.open_spimescape_portal(60) -> spimescapes created and populated
2016-11-19T22:30:47[INFO    ] dragonfly.subcommands.open_spimescape_portal(61) -> Configuration of status_multido complete, star            
ting consumption
2016-11-19T22:30:47[INFO    ] dripline.core.service(398) -> starting event loop for node status_multido-----------------------------
2016-11-19T22:30:47[DEBUG   ] dripline.core.service(93) -> Connecting to myrna.p8
2016-11-19T22:30:47[INFO    ] dripline.core.service(409) -> calling setup methods
2016-11-19T22:30:47[INFO    ] dripline.core.service(423) -> startup calls complete
-----------------------------
2016-11-19T22:30:47[DEBUG   ] dripline.core.service(106) -> Connection opened
2016-11-19T22:30:47[DEBUG   ] dripline.core.service(115) -> Adding connection close callback
2016-11-19T22:30:47[DEBUG   ] dripline.core.service(158) -> Creating a new channel
2016-11-19T22:30:47[DEBUG   ] dripline.core.service(170) -> Channel opened
2016-11-19T22:30:47[DEBUG   ] dripline.core.service(182) -> Adding channel close callback
2016-11-19T22:30:47[DEBUG   ] dripline.core.service(209) -> Declaring exchange requests
2016-11-19T22:30:47[DEBUG   ] dripline.core.service(209) -> Declaring exchange alerts
2016-11-19T22:30:47[DEBUG   ] dripline.core.service(221) -> Exchange declared
2016-11-19T22:30:47[DEBUG   ] dripline.core.service(232) -> Declaring queue status_multido
2016-11-19T22:30:47[DEBUG   ] dripline.core.service(221) -> Exchange declared
2016-11-19T22:30:47[DEBUG   ] dripline.core.service(232) -> Declaring queue status_multido
2016-11-19T22:30:47[WARNING ] dripline.core.service(198) -> Channel 1 was closed: (405) RESOURCE_LOCKED - cannot obtain exclusiv            e access to locked queue 'status_multido' in vhost '/'
2016-11-19T22:30:47[WARNING ] dripline.core.service(133) -> Connection closed, reopening in 5 seconds: (0) Not specified

the logs then quickly get boring as they repeat every 5 sec:

2016-11-19T22:30:52[DEBUG   ] dripline.core.service(93) -> Connecting to myrna.p8
2016-11-19T22:30:52[DEBUG   ] dripline.core.service(106) -> Connection opened
2016-11-19T22:30:52[DEBUG   ] dripline.core.service(115) -> Adding connection close callback
2016-11-19T22:30:52[DEBUG   ] dripline.core.service(158) -> Creating a new channel
2016-11-19T22:30:52[DEBUG   ] dripline.core.service(170) -> Channel opened
2016-11-19T22:30:52[DEBUG   ] dripline.core.service(182) -> Adding channel close callback
2016-11-19T22:30:52[DEBUG   ] dripline.core.service(209) -> Declaring exchange requests
2016-11-19T22:30:52[DEBUG   ] dripline.core.service(209) -> Declaring exchange alerts
2016-11-19T22:30:52[DEBUG   ] dripline.core.service(221) -> Exchange declared
2016-11-19T22:30:52[DEBUG   ] dripline.core.service(232) -> Declaring queue status_multido
2016-11-19T22:30:52[DEBUG   ] dripline.core.service(221) -> Exchange declared
2016-11-19T22:30:52[DEBUG   ] dripline.core.service(232) -> Declaring queue status_multido
2016-11-19T22:30:52[WARNING ] dripline.core.service(198) -> Channel 1 was closed: (405) RESOURCE_LOCKED - cannot obtain exclusiv            e access to locked queue 'status_multido' in vhost '/'
2016-11-19T22:30:52[WARNING ] dripline.core.service(133) -> Connection closed, reopening in 5 seconds: (0) Not specified

and they continue doing this until stuff start failing:

2016-11-19T22:50:43[DEBUG   ] dripline.core.service(93) -> Connecting to myrna.p8
2016-11-19T22:50:43[DEBUG   ] dripline.core.service(106) -> Connection opened
Traceback (most recent call last):
  File "/usr/lib/python2.7/logging/__init__.py", line 859, in emit
    msg = self.format(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 732, in format
    return fmt.format(record)
  File "/home/project8/python_environments/drag_v1.2.7_drip_wp2.0.0_go1.2.2_hard_v1.2.5/local/lib/python2.7/site-packages/colorlog/colorlog.py", line 128, in format
    message = super(ColoredFormatter, self).format(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 474, in format
    s = self._fmt % record.__dict__
  File "/home/project8/python_environments/drag_v1.2.7_drip_wp2.0.0_go1.2.2_hard_v1.2.5/local/lib/python2.7/site-packages/colorlog/colorlog.py", line 45, in __missing__
    "or color sequence".format(name))
KeyError: 'purple is not a valid record attribute or color sequence'
Logged from file service.py, line 115
Traceback (most recent call last):
  File "/usr/lib/python2.7/logging/__init__.py", line 859, in emit
    msg = self.format(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 732, in format
    return fmt.format(record)
  File "/home/project8/python_environments/drag_v1.2.7_drip_wp2.0.0_go1.2.2_hard_v1.2.5/local/lib/python2.7/site-packages/colorlog/colorlog.py", line 128, in format
    message = super(ColoredFormatter, self).format(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 474, in format
    s = self._fmt % record.__dict__
  File "/home/project8/python_environments/drag_v1.2.7_drip_wp2.0.0_go1.2.2_hard_v1.2.5/local/lib/python2.7/site-packages/colorlog/colorlog.py", line 45, in __missing__
    "or color sequence".format(name))
KeyError: 'purple is not a valid record attribute or color sequence'
Logged from file service.py, line 158
2016-11-19T22:50:43[DEBUG   ] dripline.core.service(170) -> Channel opened
2016-11-19T22:50:43[DEBUG   ] dripline.core.service(182) -> Adding channel close callback
2016-11-19T22:50:43[DEBUG   ] dripline.core.service(209) -> Declaring exchange requests
2016-11-19T22:50:43[DEBUG   ] dripline.core.service(209) -> Declaring exchange alerts
2016-11-19T22:50:43[DEBUG   ] dripline.core.service(221) -> Exchange declared
2016-11-19T22:50:43[DEBUG   ] dripline.core.service(232) -> Declaring queue status_multido
2016-11-19T22:50:43[DEBUG   ] dripline.core.service(221) -> Exchange declared
2016-11-19T22:50:43[DEBUG   ] dripline.core.service(232) -> Declaring queue status_multido
2016-11-19T22:50:43[WARNING ] dripline.core.service(198) -> Channel 1 was closed: (405) RESOURCE_LOCKED - cannot obtain exclusiv            e access to locked queue 'status_multido' in vhost '/'
Traceback (most recent call last):
  File "/usr/lib/python2.7/logging/__init__.py", line 859, in emit
    msg = self.format(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 732, in format
    return fmt.format(record)
  File "/home/project8/python_environments/drag_v1.2.7_drip_wp2.0.0_go1.2.2_hard_v1.2.5/local/lib/python2.7/site-packages/colorlog/colorlog.py", line 128, in format
    message = super(ColoredFormatter, self).format(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 474, in format
    s = self._fmt % record.__dict__
  File "/home/project8/python_environments/drag_v1.2.7_drip_wp2.0.0_go1.2.2_hard_v1.2.5/local/lib/python2.7/site-packages/colorlog/colorlog.py", line 45, in __missing__
    "or color sequence".format(name))
RuntimeError: maximum recursion depth exceeded while calling a Python object
Logged from file service.py, line 133
2016-11-19T22:50:48[DEBUG   ] dripline.core.service(93) -> Connecting to myrna.p8
Traceback (most recent call last):
  File "/usr/lib/python2.7/logging/__init__.py", line 859, in emit
    msg = self.format(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 732, in format
    return fmt.format(record)
  File "/home/project8/python_environments/drag_v1.2.7_drip_wp2.0.0_go1.2.2_hard_v1.2.5/local/lib/python2.7/site-packages/colorlog/colorlog.py", line 117, in format
    record = ColoredRecord(record)
2016-11-19T22:50:48[CRITICAL] dripline.core.service(435) -> Service <status_multido> crashing with error message:
maximum recursion depth exceeded while calling a Python object
2016-11-19T22:50:48[ERROR   ] dripline.core.service(436) -> Traceback (most recent call last):

at this point we are so deep in a recursion loop it takes 2000 lines to unwrap it, and the service crashes:

RuntimeError: maximum recursion depth exceeded while calling a Python object

2016-11-19T22:50:48[DEBUG   ] dripline.core.service(404) -> loop ended
warning: slack is only ever critical, setting that
supposed to do an emit

This isn't an issue, per se, it's really a discussion starting point. Automatically trying to restart is a good thing, and eventually crashing is the right behavior. But maybe it shouldn't take 20 minutes and the 240th connection attempt to crash?

laroque commented 7 years ago

Maybe this is in that output but I'm just going to ask:

Are these supervisord restarts, or is there some retry loop in the startup of dragonfly?
- If the former, presumably we could just change that setting to try fewer times. I know supervisord has some capacity to review execution return values. We could potentially have this return a value that indicates that restarting should not happen, but have another value for a "normal" crash, which may result in an attempted restart. There do not currently exist conventions for this but it seems worthwhile.
- If the latter, why did we change this at all? It used to just crash immediately if this name conflict existed. I think that this is the desired behavior. Deciding to retry or not should be a supervisor thing

wcpettus commented 7 years ago

Latter, the last two lines of the first code block orient you to where in service.py the logger.warning's are flying:

2016-11-19T22:30:47[WARNING ] dripline.core.service(198) -> Channel 1 was closed: (405) RESOURCE_LOCKED - cannot obtain exclusive access to locked queue 'status_multido' in vhost '/'
2016-11-19T22:30:47[WARNING ] dripline.core.service(133) -> Connection closed, reopening in 5 seconds: (0) Not specified

So line 133 is in method on_connection_closed, which has this block:

        self._channel = None
        if self._closing:
            self._connection.ioloop.stop()
        else:
            logger.warning('Connection closed, reopening in 5 seconds: (%s) %s',
                           reply_code, reply_text)
            self._connection.add_timeout(5, self.reconnect)

which is why we get in this recursion loop where we keep spawning deeper levels of reconnect. I'm not sure when else this block may get called.

Supervisord defaults to startretries=3 if we choose that option.

guiguem commented 7 years ago

This issue was moved to project8/dripline-python#9

project8 / dripline

crash on recursion depth for locked resources #167