Closed ckyoog closed 5 months ago
cool story, Appreciate your dedication. Dead locks are certainly the most favorite of all bugs for me :-) I left a small request in the PR. Thanks!
FYI - https://github.com/rabbitmq/amqp091-go/pull/256 Thank you again for the investigation and PR. I would love to learn more on your usecase, are you on CNCF slack?
Yes I am.
Since the PR is merge, the issue is fixed for me, I will close this issue.
It is easy to spot this bug if the following 2 criteria are fulfilled.
(Luckily, or unluckily, my testing environment meets both of them.)
Describe the bug In the beginning, I noticed an issue of the dispatcher. Every time when the rabbitmq server goes down, and stays down for a short time, then comes back up , the dispatcher can't restart the consuming. It can reconnect to the rabbitmq server, but just can't restart the consuming. If check on the rabbitmq server side (like the management UI), will see there is no consumer for the queue.
This issue kept happening in my test env, which drove me to investigate it.
With many debugging logs added, I narrowed down the suspicious code, which is in the file
pkg/dispatcher/dispatcher.go
.Connection.NotifyClose()
orChannel.NotifyClose()
will become readable;finishConsuming()
get called, and in turn,wg.Wait()
would be called to wait for the dispatch goroutine to finish;dispatch()
which is called in the goroutine would get stuck without returning.dispatch()
is found to be stuck at the call to themsg.Ack()
watchRabbitMQConnections()
would create a new connection), according to the doc ofamqp091
, the call tomsg.Ack()
is supposed to return an error to indicate the connection is lost, rather than blocking there.(What makes the
msg.Ack()
block?)Expected behavior Every time when the rabbitmq server goes down, and stays down for no matter how long, then comes back up , the dispatcher can restart the consuming.
To Reproduce
NotifyClose
channel becomes readable, the functionfinishConsuming()
is called.Get the rabbitmq server back up, we will see no more error logs of the dispatcher about the attempts to reconnect to the rabbitmq server, which indicates that the dispatcher has established a new connection to the server. But on the rabbitmq server side, we will see no consumer for the queue. If we re-trigger the event, it won't be consumed, won't be sent to the event handler.
Knative release version From 1.10.0 to 1.13.0. (I didn't test the older version. I think they might be the same.)
Additional context In the end, I managed to figure out the root cause, which is a deadlock that blocks the
msg.Ack()
. The deadlock occurs this way.The RabbitMQ server, the remote peer, shuts down
The first deadlock occurs here.
Sending the error to the
Channel.NotifyClose
channel would block . Because there is no receiver on theChannel.NotifyClose
channel.Why is there no receiver on the
Channel.NotifyClose
? Also see the piece of code above. As we can see, the receiver of theChannel.NotifyClose
channel is this one,case <-chNotifyChannel:
. It will never be reached, because another receiver,case <- connNotifyChannel
, has been executed, and as the result, the process is either blocked in thefinishConsuming()
or returned from the current function.(BTW., This deadlock blocks one of the sub goroutine that is calling the "shutdown" function. Because it doesn't block the main goroutine, so it doesn't look like that harmful.)
The final consequence for this part is
Channel.shutdown()
couldn't return, because the deadlockch.m.Lock()
couldn't be unlocked.So obviously, this is the second deadlock.
To sum up, the whole deadlock thing is like this
amqp091
library sends the error to the unbuffered channel registered byConnection.NotifyClose()
andChannel.NotifyClose()
Connection.NotifyClose
channel, it won't read theChannel.NotifyClose
channel any longer, that blocks the sender,Channel.shutdown()
Channel.shutdown()
being blocked makes the mutexch.m.Lock()
stay lockedch.m.Lock()
staying locked blocks themsg.Ack()
msg.Ack()
blocks the dispatch goroutine, so thewg.Done()
will never be called.finishConsuming()
at the call towg.Wait()
.finishConsuming()
, in turn, blocks its caller, theConsumeFromQueue()
, that eventually results in the dispatcher being unable to restart the consumingBy referring to the doc of
amqp091
, I believe the deadlock is out of a kind of improper use of theNotifyClose
channel. Best practises for Connection and Channel notificationsSo I have come up with my own fix. A PR will be created soon.