Open tomlobato opened 1 year ago
Trying with sneakers master branch (2.13.0.pre/d761dfe1) it worked! Is commit 2.13.0.pre/d761dfe1 safe to run in production? If not, would it be a viable path to backport the fix to 2.12.0? Below are the logs from a server restart to full server and consumer recover.
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: This is the message: {"message":"Hello, everybody!"}
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: This is the message: {"message":"Hello, everybody!"}
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: This is the message: {"message":"Hello, everybody!"}
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: /home/ruby/bundle/ruby/3.0.0/gems/bunny-2.20.2/lib/bunny/concurrent/continuation_queue.rb:39:in `block in poll': Timeout::Error (Timeout::Error)
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: from /home/ruby/bundle/ruby/3.0.0/gems/bunny-2.20.2/lib/bunny/concurrent/continuation_queue.rb:30:in `synchronize'
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: from /home/ruby/bundle/ruby/3.0.0/gems/bunny-2.20.2/lib/bunny/concurrent/continuation_queue.rb:30:in `poll'
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: from /home/ruby/bundle/ruby/3.0.0/gems/bunny-2.20.2/lib/bunny/session.rb:1411:in `wait_on_continuations'
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: from /home/ruby/bundle/ruby/3.0.0/gems/bunny-2.20.2/lib/bunny/session.rb:579:in `block in close_channel'
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: from /home/ruby/bundle/ruby/3.0.0/gems/bunny-2.20.2/lib/bunny/session.rb:575:in `synchronize'
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: from /home/ruby/bundle/ruby/3.0.0/gems/bunny-2.20.2/lib/bunny/session.rb:575:in `close_channel'
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: from /home/ruby/bundle/ruby/3.0.0/gems/bunny-2.20.2/lib/bunny/channel.rb:254:in `close'
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: from script/rmq_subscribe.rb:22:in `<main>'
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:34:00.965858 #1] DEBUG -- : Sent protocol preamble
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:34:00.968932 #1] DEBUG -- : Sent connection.start-ok
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:34:00.969761 #1] DEBUG -- : Heartbeat interval negotiation: client = server, server = 60, result = 60
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: I, [2023-01-16T10:34:00.969798 #1] INFO -- : Heartbeat interval used (in seconds): 60
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:34:00.969817 #1] DEBUG -- : Will use socket read timeout of 132 seconds
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:34:00.969827 #1] DEBUG -- : Initializing channel ID allocator with channel_max = 2047
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:34:00.969929 #1] DEBUG -- : Sent connection.tune-ok with heartbeat interval = 60, frame_max = 131072, channel_max = 2047
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:34:00.970002 #1] DEBUG -- : Sent connection.open with vhost = /
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:34:00.970952 #1] DEBUG -- : Initializing heartbeat sender...
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:34:00.971302 #1] DEBUG -- : Allocated channel id: 1
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:34:00.972768 #1] DEBUG -- : Session#handle_frame on 1: #<AMQ::Protocol::Channel::OpenOk:0x00007f87043f6528 @channel_id="">
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:34:00.976283 #1] DEBUG -- : Session#handle_frame on 1: #<AMQ::Protocol::Queue::DeclareOk:0x00007f87043e4738 @queue="test1", @message_count=27, @consumer_count=0>
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:34:00.976347 #1] DEBUG -- : Channel#handle_frame on channel 1: #<AMQ::Protocol::Queue::DeclareOk:0x00007f87043e4738 @queue="test1", @message_count=27, @consumer_count=0>
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:34:00.983290 #1] DEBUG -- : Session#handle_frame on 1: #<AMQ::Protocol::Basic::ConsumeOk:0x00007f87043c9a50 @consumer_tag="bunny-1673876040000-599959624618">
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:34:00.983340 #1] DEBUG -- : Channel#handle_frame on channel 1: #<AMQ::Protocol::Basic::ConsumeOk:0x00007f87043c9a50 @consumer_tag="bunny-1673876040000-599959624618">
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:37:44.473088 #1] DEBUG -- : Session#handle_frame on 0: #<AMQ::Protocol::Connection::Close:0x00007f8704143c90 @reply_code=320, @reply_text="CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'", @class_id=0, @method_id=0>
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: W, [2023-01-16T10:37:44.473287 #1] WARN -- : Recovering from connection.close (CONNECTION_FORCED - broker forced connection closure with reason 'shutdown')
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: W, [2023-01-16T10:37:44.473780 #1] WARN -- : Will recover from a network failure (no retry limit)...
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: Done
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:37:49.474455 #1] DEBUG -- : Will attempt connection recovery...
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:37:54.474914 #1] DEBUG -- : Will attempt connection recovery...
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: W, [2023-01-16T10:37:54.475127 #1] WARN -- : Retrying connection on next host in line: rabbitmq-headless.servers:5672
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:37:54.478264 #1] DEBUG -- : Sent protocol preamble
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:37:54.479502 #1] DEBUG -- : Sent connection.start-ok
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:37:54.479567 #1] DEBUG -- : Heartbeat interval negotiation: client = server, server = 60, result = 60
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: I, [2023-01-16T10:37:54.479581 #1] INFO -- : Heartbeat interval used (in seconds): 60
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:37:54.479594 #1] DEBUG -- : Will use socket read timeout of 132 seconds
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:37:54.479796 #1] DEBUG -- : Sent connection.tune-ok with heartbeat interval = 60, frame_max = 131072, channel_max = 2047
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:37:54.480574 #1] DEBUG -- : Sent connection.open with vhost = /
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:37:54.480627 #1] DEBUG -- : Initializing heartbeat sender...
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:37:54.480890 #1] DEBUG -- : Connection is now open
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: D, [2023-01-16T10:37:54.480906 #1] DEBUG -- : Resetting recovery attempt counter after successful reconnection
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: This is the message: {"message":"Hello, everybody!"}
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: This is the message: {"message":"Hello, everybody!"}
leadfy-app/rmq-consumer-56748fdfb8-wpxd2[rmq-consumer]: This is the message: {"message":"Hello, everybody!"}
Hi @tomlobato !
I have the same problem. Did you try the version 2.13.0.pre/d761dfe1 in production?
Hey guys, We are experiencing the same problem. Did you find the underlying cause of the issue? Does upgrading sneakers to 2.13.0.pre solve the problem?
@tomlobato Did you also update Bunny to version 2.22.0 when you upgraded sneakers to version 2.13.0.pre and when the issue was resolved? Because there were a couple of fixes in Bunny gem which might be the fix:
@jondot @michaelklishin Is there an ETA for the release of 2.13.0.pre? Thanks
Hi Guys, actually 2.13.0.pre didnt fully fixed the issue. I still need to do more tests to find a set of versions and configuration to have it working. For now, I`ve disabled the automatic kubernetes upgrades. So, after a manual upgrade, I restart the workers manually.
I am using sneakers 2.12.0 (bunny 2.20.2) with a Rabbitmq 3.10.6 cluster in Kubernetes, connecting to the headless service.
When Rabbitmq servers restarts (via k8s stateful-set rollout or k8s cluster version upgrade), the consumers are able to recover the connection but does not recreate the previously existing channels, so unable to process the messages.
If testing the same scenario but directly with Bunny, without sneakers, it recreates the channels as expected. I have tested with a single thread worker and multiple threads. I have tested with classic and quorum queues.
Below is the debug logs from the moment the server sends the shutdown to sneakers consumer.
I would be very grateful if someone have any pointer to fix/understand/further debug this situation.