rabbitmq / rabbitmq-dotnet-client

RabbitMQ .NET client for .NET Standard 2.0+ and .NET 4.6.2+
https://www.rabbitmq.com/dotnet.html
Other
2.07k stars 579 forks source link

'Leak' of channels (IModel instances) in case where connection is blocked #1573

Open lukebakken opened 4 months ago

lukebakken commented 4 months ago

Discussed in https://github.com/rabbitmq/rabbitmq-dotnet-client/discussions/1131

Originally posted by **neilgreatorex** December 24, 2021 Hi, I think I've come across an issue whereby channels can be 'leaked' during a connection that has been blocked by the broker due to resource constraints. The issue occurs if you try to create new channels while the connection is blocked. The calls to CreateModel() throw a System.TimeoutException and you (quite reasonably) don't get a reference to the channel. The problem is that once the connection is unblocked, suddenly the channels that you tried to create appear on the broker end and are not cleaned up by any garbage collection in the client. Looking at the heap at this point, the problem seems to be that the sessions created in the SessionManager are not cleaned up. As far as I can see, there's no way for a user of the client library to access the SessionManager via public API. I've added my sample code to [this gist](https://gist.github.com/neilgreatorex/746c1d9af360ad4bcc5272ff6e645f6d). In the gist is also the results that I get on the console when running the code against a resource constrained (blocked) broker (along with the rabbitmqctl commands that were run and their output). Please let me know if you want me to raise an issue here on GitHub for this, or if I should post this elsewhere. Thanks, Neil PS. Merry Christmas to those who celebrate 😃🎅
michaelklishin commented 4 months ago

In the Java client, the behavior of a Connection#createChannel call is not to return unless a channel.open-ok frame is received. When the connection is blocked by a resource alarm, that will not happen until the operation times out or the alarm clears. So I guess that avoids the leak.

In Bunny, in addition to the behavior above, there's a predicate on Bunny::Connection that returns true when the connection is blocked.

If the Java client option would be too difficult or impractical to pull off, the Bunny solution plus throwing an exception might be a good enough approximation.

lukebakken commented 4 months ago

Thanks @michaelklishin. I moved this to an issue to try to reproduce it.

neilgreatorex commented 4 months ago

I was the original author of the discussion post. I did try this again recently and the behaviour still seems to be there.

lukebakken commented 4 months ago

Hi @neilgreatorex, thanks for following up.

I took you code and added it to a test to demonstrate that this issue is fixed in the latest code - https://github.com/rabbitmq/rabbitmq-dotnet-client/pull/1587

Note that the TimeoutException when creating channels on a blocked connection does not happen anymore. I'm not sure why to be honest but it's good enough for me.

neilgreatorex commented 4 months ago

Hi @lukebakken. Unfortunately I can still reproduce the issue using the code in master. I have shared my test class at https://gist.github.com/neilgreatorex/63e163c933ca8b2836b987eeac85a7b4.

To test:

  1. Startup a broker that is resource constrained (I used rabbitmqctl set_vm_memory_high_watermark absolute 10).
  2. Run the class I linked, following the console instructions.
  3. After clearing the resource constraint 5 channels appear on the broker (seen with rabbitmqctl list_channels) that the client has no reference to.
lukebakken commented 3 months ago

@neilgreatorex do you get TimeoutException thrown when you create the five channels in your env?

neilgreatorex commented 3 months ago

@neilgreatorex do you get TimeoutException thrown when you create the five channels in your env?

@lukebakken I get TaskCancelledExceptions now with the async methods

lukebakken commented 3 months ago

Thanks for the updated code, I had a bug in my test. I can reproduce this now.

lukebakken commented 3 months ago

Here is what is happening - when those CreateChannelAsync methods time out (due to not receiving a response from RabbitMQ), they should NOT be considered valid and added to the channelTasks array, even though the channels are considered open in RabbitMQ.

When the connection is unblocked, RabbitMQ sends the channel.open-ok response, but the task continuation already timed out (TaskCanceledException), so the client library can't do anything with them. The channels are "zombies" within RabbitMQ that will be closed when the connection closes.

Maaybe the next version of this library can keep track of "stale, timed-out" continuations for some time (perhaps when the connection is blocked) to see if a response is sent back, but version 7 will not do that.

cc @michaelklishin @Gsantomaggio I thought you'd find this case interesting.

neilgreatorex commented 3 months ago

@lukebakken Would it be possible that if the library received a channel.open-ok for a channel ID that it wasn't expecting (i.e. has no continuation for), it could immediately close the channel? This is just a suggestion without knowledge of how it all works and maybe it's not helpful. I don't even know if the library can tell if the channel ID is valid or not 😃

lukebakken commented 3 months ago

@neilgreatorex that's a good idea.