rabbitmq / rabbitmq-cli

Command line tools for RabbitMQ
Other
105 stars 34 forks source link

rabbitmqctl list_unresponsive_queues logged and error and returned exit code of 0 #406

Closed dreiucker closed 4 years ago

dreiucker commented 4 years ago

On a RabbitMQ 3.8.2 on Erlang 22.1.8 environment I executed rabbitmqctl list_unresponsive_queues and got the following response and exit code:

Response:

Listing unresponsive queues for vhost / ...

14:01:25.503 [error] Discarding message {'$gen_call',{<0.26225.10>,#Ref<0.939061058.265289729.105500>},{info,[name]}} from <0.26225.10> to <0.19395.54> in an old incarnation (3) of this node (

Exite code:

echo $?
0

I know that this happens due to mirrored transient queues, which is a big NONO. However, that is not the reason for the issue, but the exit code. We check the exite codes when we run rabbitmq cli commands and therefore we heavily rely that a correct exite code is returned. We can cope with unreliable usage if the cli supports us.

michaelklishin commented 4 years ago

The error listed here is not an indication of an unresponsive queue. In fact, in RabbitMQ server code this would be a warning IIRC. I'm not sure how "mirrored transient queues" can be involved. Newly elected replicas will discard commands that were sent to the previous "incarnation" on this node, that's it. It does not matter whether the queue is transient or not.

The command in question exits with an error code if any of the stream values it received from cluster nodes are errors or timed out. It's a mere speculation without a set of steps to reproduce but it looks like there were none.

michaelklishin commented 4 years ago

Here is the stream module that ultimately decides on whether a result stream contained errors and its test suite.

Here are two key RabbitMQ server functions that decide what is listed as "unresponsive". The answer is: a queue must have had terminated due to an unhandled exception or have the state of down. Anything else is considered to be responsive. Checking every queue for its complete state can be prohibitively expensive so CLI tools depend on the previously "computed" queue process state. Management UI follows a similar logic.

The logged message does not lead to or indicate that a queue process failed, merely that it was recently re-elected/promoted and had to discard some pending commands sent to its previous "incarnation". A queue process in such scenario can be performing eager synchronisation and thus be unavailable to clients for a period of time. This is not a permanent condition.

Use quorum queues if you want this "sync window" to be as short as possible.