Occurs error "ra: failed to form a new cluster nova_conductor" when using RabbitMQ cluster

Describe the bug

Hi guys: I'm using RabbitMQ as OpenStack message broker. Yesterday, I noticed that nova-conductor component continuously raise "Cannot declare a queue conductor: cluster_not_formed" in logs, and RabbitMQ cluster also raised the same error logs.

The RabbitMQ version is 3.8.29 and Erlang version is 24.3.3. They are running in k8s cluster with container image "docker.io/library/rabbitmq:3.8.29-management" and formed with 3 nodes. I use RabbitMQ several years and it is first time encountering this error.

The nodes has been running for 10 days. During the days, the nodes executed some network fault tests 5 times that set network interface state to down and lift up it 3s after. Only conductor queue has this error and the other queues are all normal. I think it is the main reason for this error and it is a accidental small probability event.

Main nova-conductor error logs are as follows:

2024-01-19 10:26:30.766 28 ERROR oslo.messaging._drivers.impl_rabbit [-] Internal amqp error (541) during queue declare,retrying in 2 seconds. Queue: [conductor], error message: [Queue.declare: (541) INTERNAL_ERROR - Cannot declare a queue 'queue 'conductor' in vhost 'nova'' on node 'rabbit@openstack-rabbitmq-rabbitmq-1.rabbitmq.openstack.svc.cluster.local': cluster_not_formed]: amqp.exceptions.InternalError: Queue.declare: (541) INTERNAL_ERROR - Cannot declare a queue 'queue 'conductor' in vhost 'nova'' on node 'rabbit@openstack-rabbitmq-rabbitmq-1.rabbitmq.openstack.svc.cluster.local': cluster_not_formedESC[00m
2024-01-19 10:26:32.181 19 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: connection already closed: kombu.exceptions.OperationalError: connection already closedESC[00m

Main RabbitMQ error logs are as follows:

2024-01-19 02:49:24.887 [info] <0.18183.1319> quorum_queue declaring {amqqueue,{resource,<<110,111,118,97>>,queue,<<99,111,110,100,117,99,116,111,114>>},true,false,none,[{<<120,45,113,117,101,117,101,45,116,121,112,101>>,longstr,<<113,117,111,114,117,109>>}],none,[],[],[],undefined,undefined,[],[],live,0,[],<<110,111,118,97>>,#{user => <<110,111,118,97>>},rabbit_quorum_queue,#{}}
2024-01-19 02:49:24.890 [error] <0.18042.1319> Supervisor {<0.18042.1319>,ra_server_sup} had child nova_conductor started with ra_server_proc:start_link(#{cluster_name => nova_conductor,friendly_name => "queue 'conductor' in vhost 'nova'",id => {nova_conductor,...},...}) at undefined exit with reason {already_started,<0.25555.619>} in context start_error
2024-01-19 02:49:24.890 [error] <0.18275.1319> ra: failed to start a server #{cluster_name => nova_conductor,friendly_name => [113,117,101,117,101,32,39,99,111,110,100,117,99,116,111,114,39,32,105,110,32,118,104,111,115,116,32,39,110,111,118,97,39],id => {nova_conductor,'rabbit@openstack-rabbitmq-rabbitmq-0.rabbitmq.openstack.svc.cluster.local'},initial_members => [{nova_conductor,'rabbit@openstack-rabbitmq-rabbitmq-0.rabbitmq.openstack.svc.cluster.local'},{nova_conductor,'rabbit@openstack-rabbitmq-rabbitmq-2.rabbitmq.openstack.svc.cluster.local'},{nova_conductor,'rabbit@openstack-rabbitmq-rabbitmq-1.rabbitmq.openstack.svc.cluster.local'}],log_init_args => #{uid => <<78,79,86,65,95,67,73,55,57,82,77,57,48,83,87,57,82,55>>},machine => {module,rabbit_fifo,#{become_leader_handler => {rabbit_quorum_queue,become_leader,[{resource,<<110,111,118,97>>,queue,<<99,111,110,100,117,99,116,111,114>>}]},created => 1705632564889,dead_letter_handler => {rabbit_quorum_queue,dead_letter_publish,[undefined,undefined,{resource,<<110,111,118,97>>,queue,<<99,111,110,100,117,99,116,111,114>>}]},delivery_limit => undefined,expires => undefined,max_bytes => undefined,max_in_memory_bytes => undefined,max_in_memory_length => undefined,max_length => undefined,name => nova_conductor,overflow_strategy => drop_head,queue_resource => {resource,<<110,111,118,97>>,queue,<<99,111,110,100,117,99,116,111,114>>},single_active_consumer_on => false}},metrics_key => {resource,<<110,111,118,97>>,queue,<<99,111,110,100,117,99,116,111,114>>},tick_timeout => 5000,uid => <<78,79,86,65,95,67,73,55,57,82,77,57,48,83,87,57,82,55>>}, error: {error,{shutdown,{failed_to_start_child,nova_conductor,{already_started,<0.25555.619>}}}}
2024-01-19 02:49:24.891 [error] <0.18183.1319> ra: failed to form a new cluster nova_conductor.
 No servers were succesfully started.
 operation queue.declare caused a connection exception internal_error: "Cannot declare a queue 'queue 'conductor' in vhost 'nova'' on node 'rabbit@openstack-rabbitmq-rabbitmq-0.rabbitmq.openstack.svc.cluster.local': cluster_not_formed"
2024-01-19 02:49:24.893 [error] <0.18290.1319> Error on AMQP connection <0.18290.1319> (10.244.6.109:58322 -> 10.244.1.54:5672 - nova-conductor:28:3c686902-0ce1-4a19-a460-0824e957f03e, vhost: 'nova', user: 'nova', state: running), channel 1:
2024-01-19 02:49:24.894 [info] <0.18290.1319> closing AMQP connection <0.18290.1319> (10.244.6.109:58322 -> 10.244.1.54:5672 - nova-conductor:28:3c686902-0ce1-4a19-a460-0824e957f03e, vhost: 'nova', user: 'nova')
2024-01-19 02:49:24.895 [info] <0.18119.1319> Closing all channels from connection '10.244.6.109:58322 -> 10.244.1.54:5672' because it has been closed
2024-01-19 02:49:26.888 [error] emulator Error in process <0.18821.1319> on node 'rabbit@openstack-rabbitmq-rabbitmq-0.rabbitmq.openstack.svc.cluster.local' with exit value:
{{badmatch,{error,not_found}},[{rabbit_quorum_queue,repair_leader_record,2,[{file,"src/rabbit_quorum_queue.erl"},{line,398}]},{rabbit_quorum_queue,'-handle_tick/3-fun-0-',10,[{file,"src/rabbit_quorum_queue.erl"},{line,379}]}]}

Full logs of 3 RabbitMQ cluster nodes:

openstack-rabbitmq-rabbitmq-0_master10716sec.log

openstack-rabbitmq-rabbitmq-1_master10715sec.log

openstack-rabbitmq-rabbitmq-2_master10714sec2.log

Please help me and I'll be very grateful for you.

Reproduction steps

Running RabbitMQ cluster with container image "docker.io/library/rabbitmq:3.8.29-management" in k8s.
Executed some network fault tests 5 times that set one node's network interface state to down and lift up it 3s after.
it is a accidental small probability event, I'm not sure if the issue can be reproduced.

By the way, It is esay to fix the error temporarily by just kill all RabbitMQ pods and then restart them.

Expected behavior

RabbitMQ cluster can work as normal.

Additional context

No response

rabbitmq / ra