In production, nightly restarts of two nodes have led to frequent errors such as "operation queue.declare caused a channel exception not_found".

Describe the bug

Versions Used: RabbitMQ 3.6.10, Erlang 19.3.6.13 Configuration:'{"ha-mode":"all"}' Number of Cluster Nodes:24

Background description: After running a mirrored cluster in production for some time, we consistently encountered certain nodes exceeding high memory watermarks, which caused service disruptions in the cluster. Initially, we manually restarted nodes to reduce memory below the high watermark. (Later, we planned to address this by reducing the number of mirrored queues.) After restarting, no other issues occurred.

Subsequently, we implemented a restart strategy: a scheduled script automatically executes stop and start commands sequentially. Each night, two nodes are restarted, with each node taking approximately 40 minutes to start. The interval between restarting the two nodes is 90 minutes. The script's main content is as follows:

After implementing this restart strategy, new issues arose in production:

Issues Observed:

During the startup process of the restarted nodes, the logs showed messages such as "Stopping all nodes on master shutdown since no synchronised slave is available." This indicates that the Master queue did not complete synchronization with all its slaves.
Subsequently, frequent error messages started appearing in the logs.
For the queues causing this error, when we execute the rabbitmqctl list_queues command in the background, we cannot find them, but they appear in the management interface, where their related information is empty.

Reproduction steps

We used the same configuration to simulate production message publishing and consumption rates in the development environment. Subsequently, we sequentially restarted all 24 nodes, resulting in identical errors and symptoms occurring.

Expected behavior

For the issues that arise after these restarts, our analysis suggests that the master queue did not complete synchronization, leading to internal inconsistencies in RabbitMQ data. However, we are unclear about the specific triggering factors. Why did we not encounter this problem when manually restarting individual nodes, yet it occurs when restarting two nodes sequentially using a script with a time interval between them? This is quite perplexing, and we hope the official team can provide an explanation. Thank you very much!

Additional context

No response

rabbitmq / rabbitmq-server