rabbitmq / rabbitmq-stream-go-client

A client library for RabbitMQ streams
MIT License
169 stars 20 forks source link

HA Producer fails to publish after reconnection #351

Closed hiimjako closed 1 month ago

hiimjako commented 1 month ago

Describe the bug

Under certain conditions, the HA producer fails to reconnect and/or publish to the RabbitMQ server. In my specific case, this issue seems to occur when the RabbitMQ instance restarts after changing the k8s nodes due to automatic optimizations by tools like Karpenter, though it is not deterministic.

Client v1.4.9 logs: (error around line 110) c37-logs-2024-09-20 15_35_27.txt

RabbitMQ v3.13.6 (same error even on v3.13.7) server logs: (error around line 249) rabbitMQ-logs-2024-09-20 15_31_22.txt

From the logs, it appears that after disconnection, the client attempts to reconnect but encounters a message indicating:

[error] - timeout 10000 ns - waiting Code, operation: Command not handled 5
[info] - [Reliable] - The stream producer c37-118-server-fc7bff59c-q5sbz for stream stat exists. stat reconnected.

It seems the client mistakenly believes it has reconnected, even though it has not. Each time I experience this reconnection problem, the log Command not handled 5 appears, which may be related.

After this "reconnection", attempts to publish messages result in:

Producer BatchSend error during flush: write tcp 10.2.83.77:60800->10.2.83.73:5552: write: broken pipe

This is the same error returned when trying to publish a message while disconnected. Following this, I see numerous logs stating:

error during send producer id: 3  closed

A significant concern is that after this failed reconnection, the client believes it is connected, so the call to producer.GetStatus() returns "open", leading my health checks to indicate that the pod is healthy when it is not. This situation results in the pod remaining active and requires manual intervention.

Reproduction steps

Following these steps seems to consistently reproduce the issue:

  1. Start a RabbitMQ server, client, and RabbitMQ messaging operator (with stream and user with r/w permissions on it, used by the client. Everything in the default virtual host)
  2. Stop the RabbitMQ server
  3. Wait until the RabbitMQ server is fully stopped
  4. Restart the RabbitMQ messaging operator
  5. Start the RabbitMQ server again
  6. The client should now return the error

It seems that the interaction between the operators and RabbitMQ leads to the error, but this is just a guess.

Expected behavior

I expect the client reconnects reliably, or at least signals that is not correctly connected.

Additional context

Service and RabbitMQ deployed in a K8s cluster on AWS managed by Cluster Operator and Messaging Topology Operator. The RabbitMQ instance is accessed via a hostname that is routed through an ClusterIP.

hiimjako commented 1 month ago

PR #352 seems to solve the problem, so for me the issue can be closed.