streadway / amqp

Go client for AMQP 0.9.1
http://godoc.org/github.com/streadway/amqp
BSD 2-Clause "Simplified" License
4.88k stars 621 forks source link

fix #254 Channel.Publish blocks indefinitely #508

Open sljeff opened 3 years ago

sljeff commented 3 years ago

I encountered a bug similar to this comment: https://github.com/streadway/amqp/issues/254#issuecomment-316401642

However, in a weak network environment, I get a deadlock after N+1 messages have been sent where N is the size of the buffered channel.


How to Reproduce:

Then it will deadlock.


The sequence of events is:

  1. Receive an old message's ack -> One() -> resequence() -> confirm() send ack to channel
  2. Receive another old message's ack -> One() get lock -> resequence() -> confirm() send ack to channel -> block (channel buffer == 1) (and lock confirms.m is not released)
  3. I am sending a message using Publish -> Failed to get lock

This time it will keep blocking; simply because the buffer is 1 but two acks are received at the same time (although I sent messages serially, we can indeed reproduce this extreme case).


I don't think Publish needs to use the same lock as the other methods, it's just a self-incrementing. Even if the program is in resequence(), there may not be a problem with Publish() running at this point.

nikitasadok commented 3 years ago

I have had a very similar issue lately. There isn't anything special about the reproducing, there would just be a random deadlock which would prevent all the other Publish to this channel. It would lock up for at least an hour (didn't check longer time periods, but I am pretty sure that the deadlock would still be there). However, there was still one weird problem. At about the time when the deadlock appeared, I could see the following log in RabbitMQ Container: [error] <...> closing AMQP connection <...>, so I am assuming that some connection problems could have caused this mixup in acks and lead to deadlock. I completely agree with this pull request as I think that semantically Publish has to have another lock since it is just an increment which can be done more efficiently as well as more clearly with atomic operation.

sljeff commented 3 years ago

@michaelklishin @streadway

hj24 commented 3 years ago

same question here. @michaelklishin @streadway