Under certain conditions, the HA producer fails to reconnect and/or publish to the RabbitMQ server. In my specific case, this issue seems to occur when the RabbitMQ instance restarts after changing the k8s nodes due to automatic optimizations by tools like Karpenter, though it is not deterministic.
From the logs, it appears that after disconnection, the client attempts to reconnect but encounters a message indicating:
[error] - timeout 10000 ns - waiting Code, operation: Command not handled 5
[info] - [Reliable] - The stream producer c37-118-server-fc7bff59c-q5sbz for stream stat exists. stat reconnected.
It seems the client mistakenly believes it has reconnected, even though it has not. Each time I experience this reconnection problem, the log Command not handled 5 appears, which may be related.
After this "reconnection", attempts to publish messages result in:
This is the same error returned when trying to publish a message while disconnected. Following this, I see numerous logs stating:
error during send producer id: 3 closed
A significant concern is that after this failed reconnection, the client believes it is connected, so the call to producer.GetStatus() returns "open", leading my health checks to indicate that the pod is healthy when it is not. This situation results in the pod remaining active and requires manual intervention.
Reproduction steps
Following these steps seems to consistently reproduce the issue:
Start a RabbitMQ server, client, and RabbitMQ messaging operator (with stream and user with r/w permissions on it, used by the client. Everything in the default virtual host)
Stop the RabbitMQ server
Wait until the RabbitMQ server is fully stopped
Restart the RabbitMQ messaging operator
Start the RabbitMQ server again
The client should now return the error
It seems that the interaction between the operators and RabbitMQ leads to the error, but this is just a guess.
Expected behavior
I expect the client reconnects reliably, or at least signals that is not correctly connected.
Additional context
Service and RabbitMQ deployed in a K8s cluster on AWS managed by Cluster Operator and Messaging Topology Operator. The RabbitMQ instance is accessed via a hostname that is routed through an ClusterIP.
Describe the bug
Under certain conditions, the HA producer fails to reconnect and/or publish to the RabbitMQ server. In my specific case, this issue seems to occur when the RabbitMQ instance restarts after changing the k8s nodes due to automatic optimizations by tools like Karpenter, though it is not deterministic.
Client v1.4.9 logs: (error around line 110) c37-logs-2024-09-20 15_35_27.txt
RabbitMQ v3.13.6 (same error even on v3.13.7) server logs: (error around line 249) rabbitMQ-logs-2024-09-20 15_31_22.txt
From the logs, it appears that after disconnection, the client attempts to reconnect but encounters a message indicating:
It seems the client mistakenly believes it has reconnected, even though it has not. Each time I experience this reconnection problem, the log
Command not handled 5
appears, which may be related.After this "reconnection", attempts to publish messages result in:
This is the same error returned when trying to publish a message while disconnected. Following this, I see numerous logs stating:
A significant concern is that after this failed reconnection, the client believes it is connected, so the call to
producer.GetStatus()
returns "open", leading my health checks to indicate that the pod is healthy when it is not. This situation results in the pod remaining active and requires manual intervention.Reproduction steps
Following these steps seems to consistently reproduce the issue:
It seems that the interaction between the operators and RabbitMQ leads to the error, but this is just a guess.
Expected behavior
I expect the client reconnects reliably, or at least signals that is not correctly connected.
Additional context
Service and RabbitMQ deployed in a K8s cluster on AWS managed by Cluster Operator and Messaging Topology Operator. The RabbitMQ instance is accessed via a hostname that is routed through an ClusterIP.