gmr / rabbitpy

A pure python, thread-safe, minimalistic and pythonic RabbitMQ client library
http://rabbitpy.readthedocs.org
BSD 3-Clause "New" or "Revised" License
243 stars 58 forks source link

Client heartbeat is throwing incorrect ConnectionResetException #58

Closed paradis closed 9 years ago

paradis commented 9 years ago

Hi, Thank you for all your work on rabbitpy. I think I have found a bug but I don't understand it fully.

First, it seems that a heartbeat is send by the server (v3.1.5) only if there is no exchanged message, so the current implementation will fail when a scenario like this occurs:

Moreover, in some case (but I don't manage to reproduce it everytime), when I launch a basic consumer, the ConnectionResetException is raised in the io-thread and not in the consumming thread leading the latter to hang forever (the corresponding io thread crashed).

Here is a stacktrace for the io thread in this case:

Exception in thread 0x7fb5a0015ad0-io:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.6/site-packages/rabbitpy/io.py", line 348, in run
    self._loop.run()
  File "/usr/lib/python2.6/site-packages/rabbitpy/io.py", line 190, in run
    self._poll()
  File "/usr/lib/python2.6/site-packages/rabbitpy/io.py", line 235, in _poll
    self._read()
  File "/usr/lib/python2.6/site-packages/rabbitpy/io.py", line 249, in _read
    self._data.read_callback(self._data.fd.recv(MAX_READ))
  File "/usr/lib/python2.6/site-packages/rabbitpy/io.py", line 386, in on_read
    self._channels[0][0].on_frame(value[1])
  File "/usr/lib/python2.6/site-packages/rabbitpy/channel0.py", line 110, in on_frame
    self._write_frame(heartbeat.Heartbeat())
  File "/usr/lib/python2.6/site-packages/rabbitpy/base.py", line 318, in _write_frame
    self._check_for_exceptions()
  File "/usr/lib/python2.6/site-packages/rabbitpy/base.py", line 218, in _check_for_exceptions
    raise exception
ConnectionResetException: No heartbeat in 1799.9976542 seconds

Here is the stacktrace for the running consuming thread:

Thread 4 (Thread 0x7fb5a4cde700 (LWP 56466)):
#2 <built-in function sleep>
#4 file '/usr/lib64/python2.6/threading.py', in 'wait'
#8 file '/usr/lib64/python2.6/Queue.py', in 'get'
#12 file '/usr/lib/python2.6/site-packages/rabbitpy/base.py', in '_read_from_queue'
#15 file '/usr/lib/python2.6/site-packages/rabbitpy/base.py', in '_wait_on_frame'
#19 file '/usr/lib/python2.6/site-packages/rabbitpy/channel.py', in '_consume_message'
#22 file '/usr/lib/python2.6/site-packages/rabbitpy/amqp_queue.py', in 'next_message'
#24 file '/usr/lib/python2.6/site-packages/rabbitpy/amqp_queue.py', in '__iter__'

I hope this will be enough for you to understand the problem, but if I can be of any help, I will be glad to help.

eandersson commented 9 years ago

I saw this as a possibility when reviewing the implementation but was never able to fully reproduce it. Mostly as my understanding of the heartbeat mechanic is somewhat limited.

gmr commented 9 years ago

Thanks for the info, hopefully I can find the case where it's not working as expected.

re "Some message are regularly exchanged and the server doesn't send heartbeat": You should always have a heartbeat frame sent by the server in the pre-negotiated interval.

Can you set logging to debug mode:

import logging

logging.basicConfig(level=logging.DEBUG)

and then connect to RabbitMQ with your problematic app and post the log lines from DEBUG:rabbitpy.channel0:Received frame: 'Connection.Start to DEBUG:rabbitpy.channel0:Connection opened?

Here's an example from my laptop:

DEBUG:rabbitpy.channel0:Received frame: 'Connection.Start'
DEBUG:rabbitpy.channel0:Server information: 'Licensed under the MPL.  See http://www.rabbitmq.com/'
DEBUG:rabbitpy.channel0:Server product: 'RabbitMQ'
DEBUG:rabbitpy.channel0:Server copyright: 'Copyright (C) 2007-2014 GoPivotal, Inc.'
DEBUG:rabbitpy.channel0:Server supports exchange_exchange_bindings: True
DEBUG:rabbitpy.channel0:Server supports connection.blocked: True
DEBUG:rabbitpy.channel0:Server supports authentication_failure_close: True
DEBUG:rabbitpy.channel0:Server supports basic.nack: True
DEBUG:rabbitpy.channel0:Server supports per_consumer_qos: True
DEBUG:rabbitpy.channel0:Server supports consumer_priorities: True
DEBUG:rabbitpy.channel0:Server supports consumer_cancel_notify: True
DEBUG:rabbitpy.channel0:Server supports publisher_confirms: True
DEBUG:rabbitpy.channel0:Server cluster_name: 'rabbit@gmr-mbp'
DEBUG:rabbitpy.channel0:Server platform: 'Erlang/OTP'
DEBUG:rabbitpy.channel0:Server version: '3.4.1'
DEBUG:rabbitpy.base:Writing frame: Connection.StartOk
DEBUG:rabbitpy.channel0:Received frame: 'Connection.Tune'
DEBUG:rabbitpy.channel0:Started a heartbeat timer that will fire in 1160 sec
DEBUG:rabbitpy.base:Writing frame: Connection.TuneOk
DEBUG:rabbitpy.base:Writing frame: Connection.Open
DEBUG:rabbitpy.channel0:Received frame: 'Connection.OpenOk'
DEBUG:rabbitpy.channel0:Connection opened

Also if you have an example that demonstrates the behavior that you can share, that would be helpful.

paradis commented 9 years ago

Here is a script and the logs to reproduce the first problem. https://gist.github.com/paradis/615e3d5607aef2f93597

# rabbitmqctl status
Status of node '<>' ...
[{running_applications,
     [{rabbitmq_management,"RabbitMQ Management Console","3.1.5"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.1.5"},
      {webmachine,"webmachine","1.10.3-rmq3.1.5-gite9359c7"},
      {mochiweb,"MochiMedia Web Server","2.7.0-rmq3.1.5-git680dba8"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.1.5"},
      {rabbit,"RabbitMQ","3.1.5"},
      {os_mon,"CPO  CXC 138 46","2.2.7"},
      {inets,"INETS  CXC 138 49","5.7.1"},
      {xmerl,"XML parser","1.2.10"},
      {mnesia,"MNESIA  CXC 138 12","4.5"},
      {amqp_client,"RabbitMQ AMQP Client","3.1.5"},
      {sasl,"SASL  CXC 138 11","2.1.10"},
      {stdlib,"ERTS  CXC 138 10","1.17.5"},
      {kernel,"ERTS  CXC 138 10","2.14.5"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang R14B04 (erts-5.8.5) [source] [64-bit] [rq:1] [async-threads:30] [kernel-poll:true]\n"},
gmr commented 9 years ago

Good news? I was able to reliably replicate this in production over the weekend.

I'm impacted time wise early this week and need to work on replicating it with debug logging turned on to figure out exactly what's going on. I'll try and address it by the end of the week.

gmr commented 9 years ago

I've just pushed a change that should address this, updating the client side heartbeat checker to get reset on every frame received from the server. This should prevent this from happening in the future.