rabbitmq / rabbitmq-stream-dotnet-client

RabbitMQ client for the stream protocol
https://rabbitmq.github.io/rabbitmq-stream-dotnet-client/stable/htmlsingle/index.html
Other
122 stars 42 forks source link

No Connection Resilience when missing too many heartbeats #393

Open bastl98 opened 1 week ago

bastl98 commented 1 week ago

Describe the bug

Currently, the stream connection is not resilient when too many heartbeats are missed and there is a timeout on the connection close. All publishes of this connection result in a client timeout error and all consumers of this connection stop working.

There is also no reconnection attempt, because this error is processed as a "normal" connection close. A manual reconnection attempt is also not possible, because all properties which could be used to check if there is a need for a reconnection attempt still indicate that the connection is available: image

Reproduction steps

1.Start RMQ Cluster with 3 nodes (docker) 2.Create a stream system and a consumer with the client lib which connects to one of the nodes (lower heartbeat interval for testing) image 3.Pause the the node to which the consumer is connected 4.Wait for 4 Heartbeats

  1. The following errors are produced by the client lib image
  2. The consumer stops working but the method .IsOpen() still returns true

Expected behavior

The timeout error should not result in a "normal" connection close, this should lead to a reconnection attempt by the library itself.

Additional context

Our RMQ Cluster has version 3.13.0 and our client lib has version 1.8.2

lukebakken commented 1 week ago

Hello, thanks for using RabbitMQ and this library.

So we don't have to guess, could you please provide a git repository with code to reproduce this issue? If one of the example projects will work, please let us know.

Have you tried this example code? https://github.com/rabbitmq/rabbitmq-stream-dotnet-client/tree/main/docs/ReliableClient

bastl98 commented 1 week ago

I have adapted the BestPracticesClient for reproduction purposes (Logging in Consumer Callback and lowered hearbeat timespan). I have attatched the BestPracticesClient and appsettings.json which i have used to reproduce the error.

Here´s the link to the repo: https://github.com/bastl98/rmq-bug-source.git

Steps:

  1. Start the client and wait until the consumer starts consuming
  2. Pause the rmq container the consumer is connected to and wait until the heartbeat threshhold is reached
  3. The connection is closed "normal" because of a timeout in the connection close procedure
  4. When restarting the conatiner, the consumer and the connection will not recover but the consumer is still open
Gsantomaggio commented 1 week ago

@bastl98 Thank you for reporting the issue.

The library is working properly. It is precisely the scope for the heartbeat to close the client when the client does not receive the "alive" from the server.

The problem here is the Consumer with the status IsOpen == true is even closed and should be set as closed. The correct status should be IsOpen == false

*EDIT See: https://github.com/rabbitmq/rabbitmq-stream-dotnet-client/issues/393#issuecomment-2419890934

bastl98 commented 1 week ago

Will the consumer handling in this case be fixed in the foreseeable future?

Gsantomaggio commented 6 days ago

Will the consumer handling in this case be fixed in the foreseeable future?

What do you mean? In this case, the consumer will be closed.

*EDIT See: https://github.com/rabbitmq/rabbitmq-stream-dotnet-client/issues/393#issuecomment-2419890934

Gsantomaggio commented 3 days ago

@bastl98 Ok you were right. The heartbeat should be considered as Unexpected close so the client should try to reconnect.

bastl98 commented 2 hours ago

@Gsantomaggio unfortunatly, the error is not resolved. I think the problem now is in the Dispose method of the connection.

When the heartbeats are missed, the Close method of the client is called, this method sets the close status of the connection to unexpected.

But in the Dispose Method of the connection, which is called in any case at the of the Close method, the close reason is set to normal again.

Screenshot 2024-10-21 051124