Open houseofcat opened 4 years ago
After playing around with it some more...
The length prior to the select is 0 and it is in fact closed (1). So it never needs to be nil
and remade... it's literally empty until it hits the select. I have no data race condition, nothing bizarre going on... so I just tried a return statement instead.
counter := 0
FlushLoop:
for {
if ch.connHost.Connection.IsClosed() {
return
}
if len(ch.Confirmations) == 100 {
ch.Confirmations = nil
ch.Confirmations = make(chan amqp.Confirmation, 100)
ch.Channel.NotifyPublish(ch.Confirmations)
}
select {
case confirmation := <-ch.Confirmations: // Some weird use case where the Channel is being flooded with confirms after connection disrupt
counter++
if counter == 10 {
fmt.Printf("ChannelID: %d - confirmations flooded (confirmation deliverytag: %d) - initiating bypass!\r\n", ch.ID, confirmation.DeliveryTag)
break FlushLoop
}
default:
return
}
}
This just works simply, I don't know if I have a memory leak but it doesn't appear to be. It isn't a race condition and it isn't a goroutine leak either. Happily able to use the ChannelPool for PublishWithConfirmations again.
// FlushConfirms removes all previous confirmations pending processing.
func (ch *ChannelHost) FlushConfirms() {
ch.chanLock.Lock()
defer ch.chanLock.Unlock()
for {
if ch.connHost.Connection.IsClosed() {
return
}
select {
case <-ch.Confirmations: // Some weird use case where the Channel is being flooded with confirms after connection disrupt
return
default:
return
}
}
}
When attempting to use a ChannelPool of ackable channels with a Confirmation
chan
hosted inside a wrapper struct (ChannelHost), I noticed that the Confirmation channel exploded with Confirmations, all identical DeliveryTags (usually 0) no matter the size of thechan
buffer during aamqp.Connection
disconnect.The
chan
used byamqp.Channel
forNotifyPublish
was also initiated close to from theamqp
library. Which is fine, except I became deadlocked on getting these confirmations out of thechan
, of which the buffer was maxed out. I have ran into a pain point where I would have loved to be able to add a listener and then removed that listener manually so I could increase the longevity of pooledamqp.Channels
.What triggered the connection closure was a little chaos engineering. Imagine a pool of
amqp.Channels
ready to be used with aPublishWithConfirmation
setup, let it run for a few seconds, then severe allamqp.Connections
manually with:rabbitmqctl.bat close_all_connections "suck it long, suck it hard trebek"
Now on recovery, we obviously reconstructed the
amqp.Channel
assigned it a new Confirmationchan
and proceeded to use it, assuming we were successful onamqp.Connection
recovery.Now, because we are doing a confirmation and I don't allow the individual instances of
amqp.Channel
to be used in parallel, it is safe for me to flush all preceding confirmations that might be sitting the buffer and then do a synchronous next publish by waiting for the very next confirmation. So I created this thisFlushConfirms()
which drained the confirmation buffer. This was working until my Chaos Engineering - so the issue only arises after severing the mainamqp.Connection
.Code that was called after
amqp.Connection
was fully recovered and we need theamqp.Channel
back in the Host.This
FlushConfirms()
produces the following output for example! So I had to receive at least 10 messages, printing out the last of the 10, and you can see it was deliverytag 0, so it's getting reused even thought that isn't supposed to be happening. The interesting thing here, is that only one channel was used for this publish! Eachamqp.Channel
gets a uniquechan
so everything should in theory multi-thread safe. I am not sure how this is happening!I did confirm the exact same value and tag.
And finally the Publish
I have solved it by switching to using transient channels on a per publish (with confirmations) but I know it's not super ideal just that it is working.
Code is located in this Repo: https://github.com/houseofcat/turbocookedrabbit
PubSub test to easily reproduce said scenario is here: https://github.com/houseofcat/turbocookedrabbit/blob/727874ecaf4548be5dcec3327178e0318460bb9f/tests/main_pubsub_test.go#L179
This may very well be an issue of my understanding of the confirmation process but I did confirm that strategy at least of Transient channel usage via pub/sub examples in this repo. It feels like its a continuous stream of broadcasts until success. The problem with that is the Connection died, the Channel was recreated, this
chan
is also closed. So it's just really weird.go version go1.14.4 windows/amd64