Classic queue can run into an exception after upgrade to RabbitMQ 3.13.7 (or 4.0.2)

netrmqdev commented 1 week ago

Describe the bug

After upgrading RMQ server to version 3.13.6 or 3.13.7 from version 3.12.14, we started having randomly "Restarting crashed queue" errors for some of our queues. Those queues can have a throughput of 300 event/s, with big messages. We have also tested version 4.0.1 but the same issue was also spotted. Everytime we rollback to 3.12.14, the issue no longer happens.

Our topology is quite simple, we have only one RabbitMQ server running on k8s with the Docker management image. Clients publishing and consuming the affected queues are written in .NET and use the officiel .NET RabitMQ client.

Here are some stacktraces of when that happens:

3.13.7

2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>              [{rabbit_msg_store,reader_pread_parse,1,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                 [{file,"rabbit_msg_store.erl"},{line,696}]},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>               {rabbit_msg_store,reader_pread,2,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                 [{file,"rabbit_msg_store.erl"},{line,688}]},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>               {rabbit_msg_store,read_from_disk,2,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                 [{file,"rabbit_msg_store.erl"},{line,648}]},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>               {rabbit_msg_store,client_read3,2,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                 [{file,"rabbit_msg_store.erl"},{line,640}]},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>               {rabbit_variable_queue,with_msg_store_state,3,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                      [{file,"rabbit_variable_queue.erl"},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                       {line,1348}]},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>               {rabbit_variable_queue,read_msg,5,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                      [{file,"rabbit_variable_queue.erl"},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                       {line,1583}]},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>               {rabbit_variable_queue,fetch,2,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                      [{file,"rabbit_variable_queue.erl"},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                       {line,616}]},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>               {rabbit_priority_queue,fetch,2,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                      [{file,"rabbit_priority_queue.erl"},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                       {line,302}]}]}
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>     offender: [{pid,<0.585.0>},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                {id,rabbit_amqqueue},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                {mfargs,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                    {rabbit_prequeue,start_link,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                        [{amqqueue,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                             {resource,<<"/">>,queue,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                 <<"QUEUE_NAME">>},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                             true,false,none,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                             [{<<"x-queue-mode">>,longstr,<<"lazy">>}],
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                             <0.867.0>,[],[],[],undefined,undefined,[],
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                             undefined,live,0,[],<<"/">>,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                             #{user => <<"rmq-internal">>},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                             rabbit_classic_queue,#{}},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                         recovery,<0.583.0>]}},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                {restart_type,transient},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                {significant,true},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                {shutdown,600000},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                {child_type,worker}]
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0> 
2024-09-21 00:31:38.852784+00:00 [error] <0.217004.0> Restarting crashed queue 'QUEUE_NAME' in vhost '/'

4.0.1

2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>     supervisor: {<0.551.0>,rabbit_amqqueue_sup}
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>     errorContext: child_terminated
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>     reason: {function_clause,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                 [{rabbit_variable_queue,d,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{delta,1053550,3357582,0,4411131}],
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_variable_queue.erl"},{line,988}]},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                  {rabbit_variable_queue,maybe_deltas_to_betas,4,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_variable_queue.erl"},{line,2385}]},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                  {rabbit_variable_queue,fetch_from_q3,1,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_variable_queue.erl"},{line,2240}]},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                  {rabbit_variable_queue,queue_out,1,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_variable_queue.erl"},{line,1262}]},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                  {rabbit_variable_queue,collect_by_predicate,3,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_variable_queue.erl"},{line,1559}]},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                  {rabbit_variable_queue,remove_by_predicate,2,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_variable_queue.erl"},{line,1497}]},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                  {rabbit_variable_queue,dropwhile,2,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_variable_queue.erl"},{line,558}]},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                  {rabbit_priority_queue,dropwhile,2,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_priority_queue.erl"},{line,246}]}]}
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>     offender: [{pid,<0.552.0>},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                {id,rabbit_amqqueue},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                {mfargs,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                    {rabbit_amqqueue_process,start_link,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                        [{amqqueue,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                             {resource,<<"/">>,queue,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                                 <<"QUEUE_NAME_2">>},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                             true,false,none,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                             [{<<"x-queue-mode">>,longstr,<<"lazy">>}],
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                             <0.1275.0>,[],[],[],undefined,undefined,[],[],
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                             stopped,0,[],<<"/">>,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                             #{user => <<"USER_NAME">>},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                             rabbit_classic_queue,#{}},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                         <0.550.0>]}},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                {restart_type,transient},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                {significant,true},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                {shutdown,600000},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                {child_type,worker}]
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0> 
2024-09-23 10:30:56.020462+00:00 [error] <0.6365.0> Restarting crashed queue 'QUEUE_NAME_2' in vhost '/'.

Those issues seem very similar to what is described here: https://github.com/rabbitmq/rabbitmq-server/discussions/10902 However this issue is supposed to be fixed since 3.13.2.

Reproduction steps

Cannot reproduce in a test environment.

Expected behavior

No queue crashing

Additional context

No response

michaelklishin commented 1 week ago

@netrmqdev how large are those "big messages"?

lhoguin commented 1 week ago

Hello! The first log is probably missing a few lines at the beginning, the line just above what you copied is important.

michaelklishin commented 1 week ago

Another question is: how long should this "300 message/second, large messages" workload run before this issue can be hit.

Last time @lhoguin and @mkuratczyk had to investigate something similar it took a few weeks to find a reasonably reliable way to reproduce. So any details would be appreciated.

netrmqdev commented 1 week ago

@netrmqdev how large are those "big messages"?

The size varies a lot between a few KB to a few MB, but most messages are less than 1MB.

Another question is: how long should this "300 message/second, large messages" workload run before this issue can be hit.

Last time @lhoguin and @mkuratczyk had to investigate something similar it took a few weeks to find a reasonably reliable way to reproduce. So any details would be appreciated.

I wish I had identified a reproductible pattern, but:

I've seen the crashes anywhere between 10 minutes after starting the RabbitMQ server, up to several days after starting it
in many cases it happens with just a low throughput (5-10-15 messages/s), but the traffic on our environments varies a lot and all kinds of traffic have seen queues crash. I'm not really sure the throughput is linked to the issues here.

Hello! The first log is probably missing a few lines at the beginning, the line just above what you copied is important.

I've found a similar occurence where I have access to more logs than the first one (the ingestion of the logs into Grafana makes things a bit messy in this case):

2024-09-18 22:52:26.280015+00:00 [error] <0.585.0>     trap_exit: true
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0>     status: running
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0>     heap_size: 66222786
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0>     stack_size: 28
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0>     reductions: 1274977716
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0>   neighbours:
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0> 
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>     supervisor: {<0.584.0>,rabbit_amqqueue_sup}
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>     errorContext: child_terminated
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>     reason: {function_clause,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                 [{rabbit_msg_store,reader_pread_parse,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [[eof]],
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_msg_store.erl"},{line,692}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                  {rabbit_msg_store,reader_pread,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_msg_store.erl"},{line,688}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                  {rabbit_msg_store,read_from_disk,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_msg_store.erl"},{line,648}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                  {rabbit_msg_store,client_read3,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_msg_store.erl"},{line,640}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                  {rabbit_variable_queue,with_msg_store_state,3,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_variable_queue.erl"},{line,1348}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                  {rabbit_variable_queue,read_msg,5,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_variable_queue.erl"},{line,1583}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                  {rabbit_variable_queue,fetch,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_variable_queue.erl"},{line,616}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                  {rabbit_priority_queue,fetch,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_priority_queue.erl"},{line,302}]}]}
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>     offender: [{pid,<0.585.0>},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                {id,rabbit_amqqueue},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                {mfargs,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                    {rabbit_prequeue,start_link,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                        [{amqqueue,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                             {resource,<<"/">>,queue,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                                 <<"QUEUE_NAME">>},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                             true,false,none,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                             [{<<"x-queue-mode">>,longstr,<<"lazy">>}],
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                             <0.878.0>,[],[],[],undefined,undefined,[],
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                             undefined,live,0,[],<<"/">>,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                             #{user => <<"rmq-internal">>},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                             rabbit_classic_queue,#{}},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                         recovery,<0.583.0>]}},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                {restart_type,transient},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                {significant,true},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                {shutdown,600000},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                {child_type,worker}]
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> 
2024-09-18 22:52:26.317526+00:00 [error] <0.93868.0> Restarting crashed queue 'QUEUE_NAME' in vhost '/'

mkuratczyk commented 1 week ago

@netrmqdev there is no backlog of messages from old versions, right? if you are saying it can happen a few days after an upgrade, I assume there are no messages older than that, correct? I just want to rule out anything upgrade related.

Side note: this queue is configured as lazy. That's not a thing since 3.12: https://www.rabbitmq.com/blog/2023/05/17/rabbitmq-3.12-performance-improvements#classic-queues-changes-to-the-lazy-mode

gomoripeti commented 1 week ago

we are also investigating a similar/same case: two single-node clusters were upgraded from 3.12.14 -> 3.13.7 They have similar topology and traffic patterns: a lot of vhosts each with one of the queues getting most of the traffic. The messages are ~20KB mixed with some ~200KB. The payloads are binary (so can contain any byte including 255). The crashing queues are long 4000-10,000 messages. The persistent msg store of the vhost has many GBs of data. (Probably dominantly belonging to these single long queues). The msgs are published via direct exchange to the queues (no fanout).

We couldn't reproduce the issue in a test environment so we don't know yet what attributes of the usage are relevant.

The queues crash when trying to fetch a message from the shared msg store, but the location does not match the content of the rdq file:

** Reason for termination ==
** {function_clause,
       [{rabbit_msg_store,reader_pread_parse,
            [[eof]],
            [{file,"rabbit_msg_store.erl"},{line,692}]},
        {rabbit_msg_store,reader_pread,2,
            [{file,"rabbit_msg_store.erl"},{line,688}]},
        {rabbit_msg_store,read_from_disk,2,
            [{file,"rabbit_msg_store.erl"},{line,648}]},
        {rabbit_msg_store,client_read3,2,
            [{file,"rabbit_msg_store.erl"},{line,640}]},
        {rabbit_variable_queue,with_msg_store_state,3,
            [{file,"rabbit_variable_queue.erl"},{line,1348}]},
        {rabbit_variable_queue,read_msg,5,
            [{file,"rabbit_variable_queue.erl"},{line,1583}]},
        {rabbit_variable_queue,fetch,2,
            [{file,"rabbit_variable_queue.erl"},{line,616}]},
        {rabbit_priority_queue,fetch,2,
            [{file,"rabbit_priority_queue.erl"},{line,302}]}]}

or

** Reason for termination ==
** {{badmatch,<<51,3,137,141,241,159,1,151,204,12,68,12,119,111,64,110,26,150,
...
                201,118,180,244,157,24,27,183,26,150,192,28,133,226,173,104>>},
    [{rabbit_msg_store,reader_pread_parse,1,
                       [{file,"rabbit_msg_store.erl"},{line,696}]},
     {rabbit_msg_store,reader_pread,2,
                       [{file,"rabbit_msg_store.erl"},{line,688}]},
     {rabbit_msg_store,read_from_disk,2,
                       [{file,"rabbit_msg_store.erl"},{line,648}]},
     {rabbit_msg_store,client_read3,2,
                       [{file,"rabbit_msg_store.erl"},{line,640}]},
     {rabbit_variable_queue,with_msg_store_state,3,
                            [{file,"rabbit_variable_queue.erl"},{line,1348}]},
     {rabbit_variable_queue,read_msg,5,
                            [{file,"rabbit_variable_queue.erl"},{line,1583}]},
     {rabbit_variable_queue,fetch,2,
                            [{file,"rabbit_variable_queue.erl"},{line,616}]},
     {rabbit_priority_queue,fetch,2,
                            [{file,"rabbit_priority_queue.erl"},{line,302}]}]}

I suspect that compaction somehow corrupts the rdq file or gets the rabbit_msg_store_ets_index out of sync.

Compaction should work in a way that it moves still referenced messages from the end towards the beginning of the file. However I could not find the problematic msg Id anywhere in the rdq file. So it was lost somehow from the file.

We could not rule out 100% that there is some data from before the upgrade, but some crashes happened more than a week after the upgrade.

netrmqdev commented 1 week ago

@netrmqdev there is no backlog of messages from old versions, right? if you are saying it can happen a few days after an upgrade, I assume there are no messages older than that, correct? I just want to rule out anything upgrade related.

No we flush everything before updating RMQ. The only exception is when I tested the 3.13.7 -> 4.0.1 migration there was some older messages as I did not want to lose data, but that's the only case.

Side note: this queue is configured as lazy. That's not a thing since 3.12: https://www.rabbitmq.com/blog/2023/05/17/rabbitmq-3.12-performance-improvements#classic-queues-changes-to-the-lazy-mode

Indeed, I noticed that while investigating and I plan to clean that very soon!

lhoguin commented 1 week ago

When the queue restarts it should tell you how many persistent messages were dropped. Better yet, killing the node and restarting it will log a few more interesting things as it will force a full node recovery and may drop messages both from the queue and from the message store, if they're truly gone. Would be great to get the logs for a node restarted like this. It would help confirm whether rabbit ate the messages or whether the message store index or queue are out of sync.

lhoguin commented 1 week ago

Has any of these environments run out of disk at some point?

netrmqdev commented 1 week ago

Has any of these environments run out of disk at some point?

No they have a lot of free space

When the queue restarts it should tell you how many persistent messages were dropped. Better yet, killing the node and restarting it will log a few more interesting things as it will force a full node recovery and may drop messages both from the queue and from the message store, if they're truly gone. Would be great to get the logs for a node restarted like this. It would help confirm whether rabbit ate the messages or whether the message store index or queue are out of sync.

I believe the dropped messages log when the queue restarts only exists in 4.0.x, so I only have those 3 logs to share (all from the same server):

2024-09-23 10:30:56.020462+00:00 [error] <0.6365.0> Restarting crashed queue 'QUEUE_NAME_1' in vhost '/'.
(...)
2024-09-23 10:33:15.393172+00:00 [warning] <0.6365.0> Queue QUEUE_NAME_1 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown

2024-09-23 13:37:26.358731+00:00 [error] <0.88507.0> Restarting crashed queue 'QUEUE_NAME_2' in vhost '/'.
2024-09-23 13:37:28.046531+00:00 [warning] <0.88507.0> Queue QUEUE_NAME_2 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown

2024-09-24 14:53:39.553861+00:00 [error] <0.269219.0> Restarting crashed queue 'QUEUE_NAME_2' in vhost '/'.
2024-09-24 14:53:42.868098+00:00 [warning] <0.269219.0> Queue QUEUE_NAME_2 in vhost / dropped 48/0/0 persistent messages and 0 transient messages after unclean shutdown

lhoguin commented 1 week ago

Thanks. No the log message exists before 4.0 but only for CQv2, I suppose you were using CQv1.

The first two crashes don't drop any message.

QUEUE_NAME_2 in vhost / dropped 48/0/0 persistent messages and 0 transient messages after unclean shutdown

This one tells us that the queue failed to find 48 of its messages in the shared message store (more precisely in its index).

Triggering a full restart by killing the node (I insist on killing the node, brutally shutting down the container or the node's process, as a normal restart won't do) will provide us with additional information in the logs to determine where those messages are.

gomoripeti commented 1 week ago

There are two types of cases I investigated

The first ones when the crash was actively happening, the second ones just inspecting backup of the data dir (without any logs or in-memory info available)

First cases

First case I was able to trace the msg store index lookup by

recon_trace:calls([{rabbit_msg_store, index_lookup_positive_ref_count, return_trace}], 10, [{pid, amqqueue:get_pid(element(2, rabbit_amqqueue:lookup(rabbit_misc:r(<<"VHOST">>,queue,<<"CRASHING_QUEUE">>))))}]).

And the MsgId was always found in the index with ref count = 1, eg:

rabbit_msg_store:index_lookup_positive_ref_count/2 --> {msg_location,<<42,0,200,124,102,232,152,34,183,160,227,152,159,25,95,231>>,
    1,33915,6222253,378631}

The crash contains the queue state, from field q3 I can find the SeqId<->MsgId mappings (and infer publish order). eg:

[{msg_status,77555,
  <<42,0,200,124,102,232,152,34,183,160,227,152,159,25,95,231>>,
  undefined,true,true,rabbit_msg_store,true,
  msg_store,
  {message_properties,undefined,false,377945}},
 {msg_status,77556,
  <<130,242,201,36,170,75,117,91,53,82,42,188,0,214,50,151>>,
  undefined,true,undefined,rabbit_msg_store,true,
  msg_store,
  {message_properties,undefined,false,18329}},
 {msg_status,77557,
  <<198,29,81,216,243,153,73,243,243,79,123,100,248,82,90,220>>,
  undefined,true,undefined,rabbit_msg_store,true,
  msg_store,
  {message_properties,undefined,false,18329}},
...

Processed 33915.rdq and associated entries with queue sequence Ids. The first message 77555 is missing, but all the others are from the head (q3) of the crashed queue

33915.rdq content

Format is `{Offset, Size, MsgId, SeqId}` or `{Offset, Size, zeros}` for zero-holes ``` {0,281655,<<15,136,53,26,128,81,192,115,6,50,41,229,191,220,218,113>>,77598}, {281655,39229,<<41,60,124,86,181,148,210,178,9,186,28,255,134,99,233,150>>,77597}, {320884,131501,<<213,122,164,148,74,244,226,34,32,134,96,169,188,46,206,77>>,77596}, {452385,19021,<<167,197,103,118,214,105,204,87,245,252,196,61,106,114,130,139>>,77595}, {471406,25069,<<111,35,18,140,162,139,236,187,82,57,163,75,35,149,8,106>>,77594}, {496475,19021,<<93,160,151,55,225,61,226,74,61,26,177,199,240,178,79,240>>,77593}, {515496,15981,<<140,128,59,121,58,157,30,56,96,186,38,240,220,39,83,141>>,77592}, {531477,15101,<<143,144,32,118,80,167,104,244,230,186,77,181,188,157,117,125>>,77591}, {546578,39245,<<105,17,248,251,123,88,134,242,105,42,109,195,182,29,37,65>>,77590}, {585823,429964,<<179,55,227,126,4,3,85,64,0,59,149,56,18,114,235,49>>,77589}, {1015787,19021,<<147,123,187,27,7,24,134,121,138,6,141,202,183,52,182,62>>,77588}, {1034808,9933,<<41,132,140,249,145,109,251,160,62,125,2,238,2,79,204,207>>,77587}, {1044741,19005,<<153,62,107,62,155,113,167,241,7,73,193,234,236,30,33,183>>,77599}, {1063746,19005,<<165,40,148,32,57,16,242,59,210,71,18,102,62,89,51,209>>,77600}, {1082751,15101,<<106,126,121,8,191,167,18,23,155,55,63,75,80,100,55,31>>,77601}, {1097852,19005,<<90,128,67,235,135,235,209,215,23,249,142,23,175,16,249,78>>,77586}, {1116857,19021,<<235,227,93,47,32,176,85,123,112,212,98,100,178,233,14,10>>,77602}, {1135878,19021,<<21,23,77,157,215,42,33,243,101,195,47,249,10,141,64,149>>,77603}, {1154899,12957,<<25,16,251,192,116,38,25,73,77,121,205,252,144,18,95,192>>,77585}, {1167856,12957,<<35,131,17,134,207,157,48,49,241,196,3,224,180,15,231,134>>,77604}, {1180813,15981,<<52,219,231,45,62,86,167,184,115,208,172,18,182,83,224,196>>,77605}, {1196794,15981,<<21,204,21,209,208,79,89,7,102,28,185,195,197,246,143,22>>,77606}, {1212775,15981,<<31,160,139,232,218,86,4,254,197,218,37,178,118,28,197,85>>,77607}, {1228756,39229,<<247,19,233,251,178,216,13,157,6,215,234,27,220,188,173,77>>,77608}, {1267985,25069,<<109,18,158,146,183,121,217,196,107,197,116,137,3,197,117,206>>,77609}, {1293054,19005,<<232,11,115,53,197,240,156,177,84,105,54,60,57,95,212,181>>,77584}, {1312059,35517,<<92,181,210,222,239,115,50,57,153,111,247,108,199,149,105,217>>,77610}, {1347576,25053,<<161,208,148,111,3,38,172,216,188,98,69,9,145,153,170,141>>,77583}, {1372629,12957,<<166,116,195,132,38,158,134,14,25,125,54,209,141,30,11,179>>,77582}, {1385586,15101,<<180,224,229,246,220,224,228,177,223,37,145,127,174,156,202,228>>,77611}, {1400687,22029,<<217,46,191,202,135,212,33,127,64,65,72,166,253,57,145,39>>,77612}, {1422716,373383,<<7,205,44,193,200,20,44,180,18,8,26,173,185,61,138,56>>,77613}, {1796099,19005,<<75,34,208,84,201,186,238,210,29,178,209,102,191,68,233,209>>,77581}, {1815104,15101,<<227,139,224,220,59,240,175,45,147,18,44,36,128,234,69,138>>,77580}, {1830205,39229,<<121,149,66,78,151,174,252,97,241,73,70,67,36,144,45,92>>,77579}, {1869434,22029,<<60,148,227,129,111,25,25,211,136,220,4,13,102,231,186,34>>,77578}, {1891463,25053,<<41,25,20,41,107,31,27,234,180,72,231,140,231,197,29,222>>,77577}, {1916516,188226,<<105,88,108,13,238,100,57,34,112,153,204,248,168,226,131,9>>,77576}, {2104742,19005,<<226,254,138,133,55,136,177,166,25,193,160,245,158,253,245,218>>,77575}, {2123747,22029,<<24,201,18,119,244,65,10,139,159,190,174,164,130,152,62,134>>,77574}, {2145776,202322,<<15,4,233,55,44,212,200,14,157,81,243,165,223,87,252,102>>,77573}, {2348098,12957,<<155,32,213,210,137,243,111,184,146,85,26,146,177,133,59,171>>,77572}, {2361055,22029,<<105,203,138,248,239,176,15,33,44,213,228,150,236,136,24,252>>,77571}, {2383084,22045,<<193,54,69,189,42,165,238,121,56,118,84,19,35,208,33,57>>,77570}, {2405129,15101,<<147,59,236,173,183,48,62,21,249,64,17,174,18,211,186,106>>,77569}, {2420230,39229,<<196,11,133,67,181,170,108,32,106,123,253,3,78,112,47,187>>,77568}, {2459459,341063,<<153,96,201,214,229,11,136,202,174,112,120,64,223,209,145,35>>,77567}, {2800522,19005,<<96,236,151,145,94,238,243,65,62,104,246,45,66,27,113,114>>,77566}, {2819527,41085,<<235,214,98,8,252,2,204,171,94,132,97,188,96,134,5,108>>,77565}, {2860612,15981,<<0,35,51,81,45,94,32,134,252,83,253,41,23,212,174,163>>,77564}, {2876593,12957,<<9,223,69,251,191,203,87,225,126,186,6,72,220,135,175,221>>,77563}, {2889550,12957,<<89,27,228,152,35,223,65,203,55,16,177,150,75,123,240,138>>,77562}, {2902507,22045,<<123,122,9,191,241,59,209,185,35,76,9,133,134,123,133,217>>,77561}, {2924552,19005,<<185,205,169,78,127,94,33,111,119,101,67,179,193,113,81,169>>,77560}, {2943557,15101,<<72,60,100,69,67,33,69,253,187,115,31,33,207,217,0,117>>,77559}, {2958658,15981,<<193,48,7,100,105,167,32,245,154,142,124,187,122,196,165,83>>,77558}, {2974639,19005,<<198,29,81,216,243,153,73,243,243,79,123,100,248,82,90,220>>,77557}, {2993644,19005,<<130,242,201,36,170,75,117,91,53,82,42,188,0,214,50,151>>,77556}, {3012649,19005,<<199,248,153,18,29,228,175,11,182,209,63,158,15,236,181,166>>,77614}, {3031654,15981,<<244,151,151,42,77,170,29,18,66,70,168,180,165,107,84,3>>,77615}, {3047635,15981,<<131,168,111,204,242,17,52,87,92,141,254,179,240,130,216,72>>,77616}, {3063616,25069,<<84,199,58,217,248,200,28,34,8,74,196,106,5,228,156,15>>,77617}, {3088685,15981,<<132,2,95,133,137,172,7,224,140,72,155,239,37,128,159,111>>,77618}, {3104666,39245,<<236,239,131,72,69,39,135,200,96,167,24,167,81,0,193,139>>,77619}, {3143911,19021,<<125,153,5,11,76,62,220,39,29,62,29,172,227,244,146,254>>,77620}, {3162932,12957,<<100,212,81,36,154,30,222,118,121,234,76,136,44,152,192,124>>,77621}, {3175889,19021,<<31,238,224,32,74,156,108,212,102,4,172,168,134,2,61,96>>,77622}, {3194910,19005,<<102,222,148,111,129,243,104,216,36,152,60,219,102,39,66,133>>,77623}, {3213915,427340,<<138,116,71,21,200,203,57,75,174,236,123,206,236,105,172,193>>,77624}, {3641255,2_341_404,zeros}, {5982659,427340,{dup,<<138,116,71,21,200,203,57,75,174,236,123,206,236,105,172,193>>},77624}, <- offset 6_222_253 of missing msg 77555 {6409999,19005,{dup,<<102,222,148,111,129,243,104,216,36,152,60,219,102,39,66,133>>},77623}, {6429004,19021,{dup,<<31,238,224,32,74,156,108,212,102,4,172,168,134,2,61,96>>},77622}, {6448025,12957,{dup,<<100,212,81,36,154,30,222,118,121,234,76,136,44,152,192,124>>},77621}, {6460982,19021,{dup,<<125,153,5,11,76,62,220,39,29,62,29,172,227,244,146,254>>},77620}, {6480003,39245,{dup,<<236,239,131,72,69,39,135,200,96,167,24,167,81,0,193,139>>},77619}, {6519248,15981,{dup,<<132,2,95,133,137,172,7,224,140,72,155,239,37,128,159,111>>},77618}, {6535229,25069,{dup,<<84,199,58,217,248,200,28,34,8,74,196,106,5,228,156,15>>},77617}, {6560298,15981,{dup,<<131,168,111,204,242,17,52,87,92,141,254,179,240,130,216,72>>},77616}, {6576279,15981,{dup,<<244,151,151,42,77,170,29,18,66,70,168,180,165,107,84,3>>},77615}, {6592260,19005,{dup,<<199,248,153,18,29,228,175,11,182,209,63,158,15,236,181,166>>},77614}, {6611265,127981,zeros}, {6739246,19005,{dup,<<130,242,201,36,170,75,117,91,53,82,42,188,0,214,50,151>>},77556}, {6758251,19765,zeros}, {6778016,19005,{dup,<<198,29,81,216,243,153,73,243,243,79,123,100,248,82,90,220>>},77557}, {6797021,28987,zeros}, {6826008,15981,{dup,<<193,48,7,100,105,167,32,245,154,142,124,187,122,196,165,83>>},77558}, {6841989,247910,zeros}, {7089899,15101,{dup,<<72,60,100,69,67,33,69,253,187,115,31,33,207,217,0,117>>},77559}, {7105000,155194,zeros}, {7260194,19005,{dup,<<185,205,169,78,127,94,33,111,119,101,67,179,193,113,81,169>>},77560}, {7279199,22045,{dup,<<123,122,9,191,241,59,209,185,35,76,9,133,134,123,133,217>>},77561}, {7301244,119317,zeros}, {7420561,12957,{dup,<<89,27,228,152,35,223,65,203,55,16,177,150,75,123,240,138>>},77562}, {7433518,143402,zeros}, {7576920,12957,{dup,<<9,223,69,251,191,203,87,225,126,186,6,72,220,135,175,221>>},77563}, {7589877,177484,zeros}, {7767361,15981,{dup,<<0,35,51,81,45,94,32,134,252,83,253,41,23,212,174,163>>},77564}, {7783342,136064,zeros}, {7919406,41085,{dup,<<235,214,98,8,252,2,204,171,94,132,97,188,96,134,5,108>>},77565}, {7960491,110533,zeros}, {8071024,19005,{dup,<<96,236,151,145,94,238,243,65,62,104,246,45,66,27,113,114>>},77566}, {8090029,18101,zeros}, {8108130,341063,{dup,<<153,96,201,214,229,11,136,202,174,112,120,64,223,209,145,35>>},77567}, {8449193,158127,zeros}, {8607320,39229,{dup,<<196,11,133,67,181,170,108,32,106,123,253,3,78,112,47,187>>},77568}, {8646549,69493,zeros}, {8716042,15101,{dup,<<147,59,236,173,183,48,62,21,249,64,17,174,18,211,186,106>>},77569}, {8731143,138640,zeros}, {8869783,22045,{dup,<<193,54,69,189,42,165,238,121,56,118,84,19,35,208,33,57>>},77570}, {8891828,22029,{dup,<<105,203,138,248,239,176,15,33,44,213,228,150,236,136,24,252>>},77571}, {8913857,288746,zeros}, {9202603,12957,{dup,<<155,32,213,210,137,243,111,184,146,85,26,146,177,133,59,171>>},77572}, {9215560,162241,zeros}, {9377801,202322,{dup,<<15,4,233,55,44,212,200,14,157,81,243,165,223,87,252,102>>},77573}, {9580123,8459,zeros}, {9588582,22029,{dup,<<24,201,18,119,244,65,10,139,159,190,174,164,130,152,62,134>>},77574}, {9610611,19005,{dup,<<226,254,138,133,55,136,177,166,25,193,160,245,158,253,245,218>>},77575}, {9629616,282472,zeros}, {9912088,188226,{dup,<<105,88,108,13,238,100,57,34,112,153,204,248,168,226,131,9>>},77576}, {10100314,18101,zeros}, {10118415,25053,{dup,<<41,25,20,41,107,31,27,234,180,72,231,140,231,197,29,222>>},77577}, {10143468,158127,zeros}, {10301595,22029,{dup,<<60,148,227,129,111,25,25,211,136,220,4,13,102,231,186,34>>},77578}, {10323624,19013,zeros}, {10342637,39229,{dup,<<121,149,66,78,151,174,252,97,241,73,70,67,36,144,45,92>>},77579}, {10381866,15101,{dup,<<227,139,224,220,59,240,175,45,147,18,44,36,128,234,69,138>>},77580}, {10396967,19005,{dup,<<75,34,208,84,201,186,238,210,29,178,209,102,191,68,233,209>>},77581}, {10415972,373383,{dup,<<7,205,44,193,200,20,44,180,18,8,26,173,185,61,138,56>>},77613}, {10789355,22029,{dup,<<217,46,191,202,135,212,33,127,64,65,72,166,253,57,145,39>>},77612}, {10811384,15101,{dup,<<180,224,229,246,220,224,228,177,223,37,145,127,174,156,202,228>>},77611}, {10826485,10195,zeros}, {10836680,12957,{dup,<<166,116,195,132,38,158,134,14,25,125,54,209,141,30,11,179>>},77582}, {10849637,25053,{dup,<<161,208,148,111,3,38,172,216,188,98,69,9,145,153,170,141>>},77583}, {10874690,35517,{dup,<<92,181,210,222,239,115,50,57,153,111,247,108,199,149,105,217>>},77610}, {10910207,11501,zeros}, {10921708,19005,{dup,<<232,11,115,53,197,240,156,177,84,105,54,60,57,95,212,181>>},77584}, {10940713,25069,{dup,<<109,18,158,146,183,121,217,196,107,197,116,137,3,197,117,206>>},77609}, {10965782,39229,{dup,<<247,19,233,251,178,216,13,157,6,215,234,27,220,188,173,77>>},77608}, {11005011,15981,{dup,<<31,160,139,232,218,86,4,254,197,218,37,178,118,28,197,85>>},77607}, {11020992,15981,{dup,<<21,204,21,209,208,79,89,7,102,28,185,195,197,246,143,22>>},77606}, {11036973,15981,{dup,<<52,219,231,45,62,86,167,184,115,208,172,18,182,83,224,196>>},77605}, {11052954,12957,{dup,<<35,131,17,134,207,157,48,49,241,196,3,224,180,15,231,134>>},77604}, {11065911,3431,zeros}, {11069342,12957,{dup,<<25,16,251,192,116,38,25,73,77,121,205,252,144,18,95,192>>},77585}, {11082299,19021,{dup,<<21,23,77,157,215,42,33,243,101,195,47,249,10,141,64,149>>},77603}, {11101320,19021,{dup,<<235,227,93,47,32,176,85,123,112,212,98,100,178,233,14,10>>},77602}, {11120341,5003,zeros}, {11125344,19005,{dup,<<90,128,67,235,135,235,209,215,23,249,142,23,175,16,249,78>>},77586}, {11144349,15101,{dup,<<106,126,121,8,191,167,18,23,155,55,63,75,80,100,55,31>>},77601}, {11159450,19005,{dup,<<165,40,148,32,57,16,242,59,210,71,18,102,62,89,51,209>>},77600}, {11178455,19005,{dup,<<153,62,107,62,155,113,167,241,7,73,193,234,236,30,33,183>>},77599}, {11197460,93865,zeros}, {11291325,9933,{dup,<<41,132,140,249,145,109,251,160,62,125,2,238,2,79,204,207>>},77587}, {11301258,26117,zeros}, {11327375,19021,{dup,<<147,123,187,27,7,24,134,121,138,6,141,202,183,52,182,62>>},77588}, {11346396,73957,zeros}, {11420353,429964,{dup,<<179,55,227,126,4,3,85,64,0,59,149,56,18,114,235,49>>},77589}, {11850317,138362,zeros}, {11988679,39245,{dup,<<105,17,248,251,123,88,134,242,105,42,109,195,182,29,37,65>>},77590}, {12027924,69493,zeros}, {12097417,15101,{dup,<<143,144,32,118,80,167,104,244,230,186,77,181,188,157,117,125>>},77591}, {12112518,15981,{dup,<<140,128,59,121,58,157,30,56,96,186,38,240,220,39,83,141>>},77592}, {12128499,276582,zeros}, {12405081,19021,{dup,<<93,160,151,55,225,61,226,74,61,26,177,199,240,178,79,240>>},77593}, {12424102,62726,zeros}, {12486828,25069,{dup,<<111,35,18,140,162,139,236,187,82,57,163,75,35,149,8,106>>},77594}, {12511897,66917,zeros}, {12578814,19021,{dup,<<167,197,103,118,214,105,204,87,245,252,196,61,106,114,130,139>>},77595}, {12597835,40992,zeros}, {12638827,131501,{dup,<<213,122,164,148,74,244,226,34,32,134,96,169,188,46,206,77>>},77596}, {12770328,107131,zeros}, {12877459,39229,{dup,<<41,60,124,86,181,148,210,178,9,186,28,255,134,99,233,150>>},77597}, {12916688,90096,zeros}, {13006784,281655,{dup,<<15,136,53,26,128,81,192,115,6,50,41,229,191,220,218,113>>},77598}, ```

The badmatch crash has a huge binary in it (luckily). I managed to identify messages 77623-77614 in them. Because there are no zeros between them and they are a continuous reversed block I suspected they were written there in one go during a compaction - potentially overwriting the missing 77555 message.

Second cases

In the second cases I only had the queue index (CQv1) on disk and the rdq files.

In one example the first index segment file starts with (format {RelSeqId, MsgId, Size})

{137,del_or_ack},
{137,del_or_ack},
{137,<<29,116,45,4,62,102,142,6,42,153,250,128,55,253, 101,164>>, 50457}},

{138,del_or_ack},
{138,<<146,223,62,74,141,134,33,54,203,94,55,94,185,164, 20,44>>, 204_137}},

{139,del_or_ack},
{139,<<26,133,18,250,208,12,72,50,90,199,205,23,190,128, 98,196>>, 95033}},
...

The first message 137 is acked, so should have ref count = 0 in the msg store. The second message 138 must be the one that causes the crash

rdq content

Format is `{Offset, Size, MsgId, RelSeqId}` or `{Offset, Size, zeros}` for zero-holes ``` {0,406636,<<107,201,64,142,168,145,210,148,90,206,85,18,161,19,35,0>>,146}, {406636,5037,<<248,82,68,196,81,78,105,7,244,78,202,246,61,193,105,255>>,145}, {411673,5693,zeros}, {417366,141170,<<64,3,75,172,94,232,137,37,88,125,224,15,72,113,204,245>>,154}, {558536,41213,<<211,32,156,253,161,218,191,143,24,126,171,163,6,149,105,42>>,144}, {599749,38061,<<42,61,20,7,164,238,70,167,49,7,21,156,242,95,135,134>>,143}, {637810,27741,<<160,231,235,43,68,242,155,152,28,250,2,50,58,224,20,36>>,142}, {665551,46173,<<203,122,57,233,195,167,126,165,164,106,114,54,133,179,25,99>>,141}, {711724,87261,<<135,145,222,45,78,30,125,36,115,223,20,219,252,22,96,47>>,147}, {798985,46781,<<0,100,99,53,55,153,6,145,10,20,169,9,8,221,71,76>>,140}, {845766,95709,<<26,133,18,250,208,12,72,50,90,199,205,23,190,128,98,196>>,139}, {941475,80557,<<214,6,70,254,89,207,50,10,165,217,82,37,252,224,87,147>>,148}, {1022032,13325,<<4,129,51,209,34,155,37,199,24,123,49,227,28,151,157,44>>,149}, {1035357,374007,<<15,85,24,102,244,136,110,116,85,155,77,116,112,11,149,125>>,150}, {1409364,41247,zeros}, {1450611,210610,<<18,238,179,233,75,215,182,95,183,55,225,204,102,249,59,57>>,153}, {1661221,51133,zeros}, {1712354,32045,<<130,218,241,202,163,3,154,159,38,176,131,252,220,188,1,78>>,151}, {1744399,107037,<<4,132,31,162,252,213,132,84,194,170,173,16,62,135,213,101>>,152}, {1851436,367305,zeros}, {2218741,107037,{dup,<<4,132,31,162,252,213,132,84,194,170,173,16,62,135,213,101>>},152}, {2325778,32045,{dup,<<130,218,241,202,163,3,154,159,38,176,131,252,220,188,1,78>>},151}, {2357823,10_527_218,zeros}, {12885041,51133,<<29,116,45,4,62,102,142,6,42,153,250,128,55,253,101,164>>,137}, {12936174,374007,{dup,<<15,85,24,102,244,136,110,116,85,155,77,116,112,11,149,125>>},150}, {13310181,13325,{dup,<<4,129,51,209,34,155,37,199,24,123,49,227,28,151,157,44>>},149}, {13323506,80557,{dup,<<214,6,70,254,89,207,50,10,165,217,82,37,252,224,87,147>>},148}, {13404063,52758,zeros}, {13456821,95709,{dup,<<26,133,18,250,208,12,72,50,90,199,205,23,190,128,98,196>>},139}, {13552530,72049,zeros}, {13624579,46781,{dup,<<0,100,99,53,55,153,6,145,10,20,169,9,8,221,71,76>>},140}, {13671360,87261,{dup,<<135,145,222,45,78,30,125,36,115,223,20,219,252,22,96,47>>},147}, {13758621,32654,zeros}, {13791275,46173,{dup,<<203,122,57,233,195,167,126,165,164,106,114,54,133,179,25,99>>},141}, {13837448,27741,{dup,<<160,231,235,43,68,242,155,152,28,250,2,50,58,224,20,36>>},142}, {13865189,108437,zeros}, {13973626,38061,{dup,<<42,61,20,7,164,238,70,167,49,7,21,156,242,95,135,134>>},143}, {14011687,160570,zeros}, {14172257,41213,{dup,<<211,32,156,253,161,218,191,143,24,126,171,163,6,149,105,42>>},144}, {14213470,134316,zeros}, {14347786,5037,{dup,<<248,82,68,196,81,78,105,7,244,78,202,246,61,193,105,255>>},145}, {14352823,79909,zeros}, {14432732,406636,{dup,<<107,201,64,142,168,145,210,148,90,206,85,18,161,19,35,0>>},146}, ```

I speculate the last compaction was maybe triggered by the ack of 137. This is supported by file timestamps as well. Both 137 and 139 are present in the rdq file (but 137 is not moved towards the beginning). There is an interesting 10MB zero-hole in the middle (right before 137) which is beyond the latest truncation line, so must be a result of a previous compaction. I haven't untangled from the order of SeqIds which compaction might have moved which message where and what could have been the original position of the missing message.

A theory

Theoretically it is possible that rabbit_msg_store:scan_file_for_valid_messages/1 misinterprets the content of an rdq file and this way the compaction loses and overwrites a live message. This can happen if a zero-hole is followed by a message, and there is a byte 255 in the appropriate place in the payload.

MsgId = binary:copy(<<"1">>, 16).
MC = 
#mc{protocol = mc_amqpl,
    data = #content{class_id = 60,properties = none,
                    properties_bin = <<48,0,0,0,0,0,2>>,
                    protocol = rabbit_framing_amqp_0_9_1,
                    payload_fragments_rev = [binary:copy(<<255>>, 26000)]},
    annotations = #{id => MsgId,
                    p => 0,
                    x => <<>>,
                    rk => [<<"queue1">>],
                    rts => 1726812956209}}.
BinMsg = term_to_binary(MC).
BinMsgSize = byte_size(BinMsg).
Entry = <<(BinMsgSize + 16):(8*8), MsgId:16/binary, BinMsg/binary, 255>>.
%% This is how `scan_file_for_valid_messages` matches on messages
<<Size2:(8*8), MsgId2:16/binary, Msg2:Size2/binary, 255, _/binary>> = <<0, Entry/binary>>.
MsgId2 == MsgId.

The payload size of 26000 bytes is selected (binary for <<0,0,0,0,0,102,68>>) so with an extra leading zero it can be interpreted as size 102 bytes. The payload in the term_to_binary format of a #mc{} starts after 85 bytes (plus 16 byte message id = 101).

This is just speculation and I don't know how likely it is to hit this coincidence.

netrmqdev commented 1 week ago

Thanks. No the log message exists before 4.0 but only for CQv2, I suppose you were using CQv1.

Oh yes indeed, all our queues are in CQv1.

The first two crashes don't drop any message.
QUEUE_NAME_2 in vhost / dropped 48/0/0 persistent messages and 0 transient messages after unclean shutdown
This one tells us that the queue failed to find 48 of its messages in the shared message store (more precisely in its index).

Triggering a full restart by killing the node (I insist on killing the node, brutally shutting down the container or the node's process, as a normal restart won't do) will provide us with additional information in the logs to determine where those messages are.

I tried that, but I'm not sure it brings a lot more information, although we can see the 2 affected queues dropping messages on restart:

2024-09-25 10:16:25.931677+00:00 [warning] <0.581.0> Queue AFFECTED_QUEUE_1 in vhost / dropped 2/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:17:41.034955+00:00 [warning] <0.548.0> Queue AFFECTED_QUEUE_2 in vhost / dropped 4/0/0 persistent messages and 0 transient messages after unclean shutdown

Full log

``` 2024-09-25 10:14:32.972368+00:00 [notice] <0.64.0> SIGTERM received - shutting down 2024-09-25 10:14:32.972368+00:00 [notice] <0.64.0> 2024-09-25 10:14:32.976459+00:00 [warning] <0.715.0> HTTP listener registry could not find context rabbitmq_prometheus_tls 2024-09-25 10:14:33.988757+00:00 [warning] <0.715.0> HTTP listener registry could not find context rabbitmq_management_tls 2024-09-25 10:14:34.010114+00:00 [info] <0.978.0> stopped TLS (SSL) listener on [::]:5671 2024-09-25 10:14:34.011933+00:00 [info] <0.869.0> stopped TCP listener on [::]:5672 2024-09-25 10:14:34.013862+00:00 [error] <0.937.0> Error on AMQP connection <0.937.0> (192.168.100.21:49848 -> 192.168.100.6:5672, vhost: '/', user: 'CLIENT_USER', state: running), channel 0: 2024-09-25 10:14:34.013862+00:00 [error] <0.937.0> operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'" (...) 2024-09-25 10:14:34.604394+00:00 [info] <0.484.0> Virtual host '/' is stopping 2024-09-25 10:14:34.604593+00:00 [info] <0.320635.0> Closing all connections in vhost '/' on node 'USER@rabbitmq' because the vhost is stopping 2024-09-25 10:14:34.633023+00:00 [info] <0.498.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' 2024-09-25 10:14:35.220420+00:00 [info] <0.498.0> Message store for directory '/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' is stopped 2024-09-25 10:14:35.220727+00:00 [info] <0.494.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' 2024-09-25 10:14:35.230686+00:00 [info] <0.494.0> Message store for directory '/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' is stopped 2024-09-25 10:14:35.279493+00:00 [notice] <0.86.0> alarm_handler: {clear,system_memory_high_watermark} =INFO REPORT==== 25-Sep-2024::10:14:37.538698 === alarm_handler: {set,{system_memory_high_watermark,[]}} 2024-09-25 10:14:40.344574+00:00 [notice] <0.44.0> Application syslog exited with reason: stopped 2024-09-25 10:14:40.350694+00:00 [notice] <0.254.0> Logging: switching to configured handler(s); following messages may not be visible in this log output 2024-09-25 10:14:40.358986+00:00 [notice] <0.254.0> Logging: configured log handlers are now ACTIVE 2024-09-25 10:14:40.372587+00:00 [info] <0.254.0> ra: starting system quorum_queues 2024-09-25 10:14:40.373522+00:00 [info] <0.254.0> starting Ra system: quorum_queues in directory: /var/lib/rabbitmq/mnesia/USER@rabbitmq/quorum/USER@rabbitmq 2024-09-25 10:14:40.439201+00:00 [info] <0.271.0> ra system 'quorum_queues' running pre init for 0 registered servers 2024-09-25 10:14:40.449471+00:00 [info] <0.272.0> ra: meta data store initialised for system quorum_queues. 0 record(s) recovered 2024-09-25 10:14:40.463966+00:00 [notice] <0.277.0> WAL: ra_log_wal init, open tbls: ra_log_open_mem_tables, closed tbls: ra_log_closed_mem_tables 2024-09-25 10:14:40.490898+00:00 [info] <0.279.0> ra_system_recover: ra system 'quorum_queues' server recovery strategy rabbit_quorum_queue:system_recover 2024-09-25 10:14:40.491070+00:00 [info] <0.279.0> [rabbit_quorum_queue:system_recover/1] rabbit not booted, skipping queue recovery 2024-09-25 10:14:40.491830+00:00 [info] <0.254.0> ra: starting system coordination 2024-09-25 10:14:40.492566+00:00 [info] <0.254.0> starting Ra system: coordination in directory: /var/lib/rabbitmq/mnesia/USER@rabbitmq/coordination/USER@rabbitmq 2024-09-25 10:14:40.493973+00:00 [info] <0.285.0> ra system 'coordination' running pre init for 1 registered servers 2024-09-25 10:14:40.508340+00:00 [info] <0.286.0> ra: meta data store initialised for system coordination. 1 record(s) recovered 2024-09-25 10:14:40.508630+00:00 [notice] <0.291.0> WAL: ra_coordination_log_wal init, open tbls: ra_coordination_log_open_mem_tables, closed tbls: ra_coordination_log_closed_mem_tables 2024-09-25 10:14:40.517090+00:00 [info] <0.254.0> ra: starting system coordination 2024-09-25 10:14:40.517816+00:00 [info] <0.254.0> starting Ra system: coordination in directory: /var/lib/rabbitmq/mnesia/USER@rabbitmq/coordination/USER@rabbitmq 2024-09-25 10:14:40.995221+00:00 [notice] <0.296.0> RabbitMQ metadata store: candidate -> leader in term: 4 machine version: 1 2024-09-25 10:14:41.255730+00:00 [info] <0.254.0> 2024-09-25 10:14:41.255730+00:00 [info] <0.254.0> Starting RabbitMQ 4.0.0+2.g6963f8b on Erlang 26.2.5.3 [jit] 2024-09-25 10:14:41.255730+00:00 [info] <0.254.0> Copyright (c) 2007-2024 Broadcom Inc and/or its subsidiaries 2024-09-25 10:14:41.255730+00:00 [info] <0.254.0> Licensed under the MPL 2.0. Website: https://rabbitmq.com ## ## RabbitMQ 4.0.0+2.g6963f8b ## ## ########## Copyright (c) 2007-2024 Broadcom Inc and/or its subsidiaries ###### ## ########## Licensed under the MPL 2.0. Website: https://rabbitmq.com Erlang: 26.2.5.3 [jit] TLS Library: OpenSSL - OpenSSL 3.3.2 3 Sep 2024 Release series support status: see https://www.rabbitmq.com/release-information Doc guides: https://www.rabbitmq.com/docs Support: https://www.rabbitmq.com/docs/contact Tutorials: https://www.rabbitmq.com/tutorials Monitoring: https://www.rabbitmq.com/docs/monitoring Upgrading: https://www.rabbitmq.com/docs/upgrade Logs: /var/log/rabbitmq/USER@rabbitmq.log Config file(s): /etc/rabbitmq/rabbitmq.conf /etc/rabbitmq/conf.d/10-defaults.conf Starting broker...2024-09-25 10:14:41.257577+00:00 [info] <0.254.0> 2024-09-25 10:14:41.257577+00:00 [info] <0.254.0> node : USER@rabbitmq 2024-09-25 10:14:41.257577+00:00 [info] <0.254.0> home dir : /var/lib/rabbitmq 2024-09-25 10:14:41.257577+00:00 [info] <0.254.0> config file(s) : /etc/rabbitmq/rabbitmq.conf 2024-09-25 10:14:41.257577+00:00 [info] <0.254.0> : /etc/rabbitmq/conf.d/10-defaults.conf 2024-09-25 10:14:41.257577+00:00 [info] <0.254.0> cookie hash : HASH 2024-09-25 10:14:41.257577+00:00 [info] <0.254.0> log(s) : /var/log/rabbitmq/USER@rabbitmq.log 2024-09-25 10:14:41.257577+00:00 [info] <0.254.0> : 2024-09-25 10:14:41.257577+00:00 [info] <0.254.0> data dir : /var/lib/rabbitmq/mnesia/USER@rabbitmq 2024-09-25 10:14:41.670013+00:00 [info] <0.254.0> Running boot step pre_boot defined by app rabbit 2024-09-25 10:14:41.670137+00:00 [info] <0.254.0> Running boot step rabbit_global_counters defined by app rabbit 2024-09-25 10:14:41.670835+00:00 [info] <0.254.0> Running boot step rabbit_osiris_metrics defined by app rabbit 2024-09-25 10:14:41.671239+00:00 [info] <0.254.0> Running boot step rabbit_core_metrics defined by app rabbit 2024-09-25 10:14:41.673157+00:00 [info] <0.254.0> Running boot step rabbit_alarm defined by app rabbit 2024-09-25 10:14:41.686595+00:00 [info] <0.406.0> Memory high watermark set to 57869 MiB (60680875622 bytes) of 64299 MiB (67423195136 bytes) total 2024-09-25 10:14:41.690230+00:00 [info] <0.408.0> Enabling free disk space monitoring (disk free space: 244394713088, total memory: 67423195136) 2024-09-25 10:14:41.690340+00:00 [info] <0.408.0> Disk free limit set to 50MB 2024-09-25 10:14:41.692268+00:00 [info] <0.254.0> Running boot step code_server_cache defined by app rabbit 2024-09-25 10:14:41.692527+00:00 [info] <0.254.0> Running boot step file_handle_cache defined by app rabbit 2024-09-25 10:14:41.699863+00:00 [info] <0.411.0> Limiting to approx 1048479 file handles (943629 sockets) 2024-09-25 10:14:41.700086+00:00 [info] <0.412.0> FHC read buffering: OFF 2024-09-25 10:14:41.700234+00:00 [info] <0.412.0> FHC write buffering: ON 2024-09-25 10:14:41.701765+00:00 [info] <0.254.0> Running boot step worker_pool defined by app rabbit 2024-09-25 10:14:41.701962+00:00 [info] <0.339.0> Will use 16 processes for default worker pool 2024-09-25 10:14:41.702074+00:00 [info] <0.339.0> Starting worker pool 'worker_pool' with 16 processes in it 2024-09-25 10:14:41.702915+00:00 [info] <0.254.0> Running boot step database defined by app rabbit 2024-09-25 10:14:41.703582+00:00 [info] <0.254.0> Peer discovery: configured backend: rabbit_peer_discovery_classic_config 2024-09-25 10:14:41.705432+00:00 [info] <0.254.0> Waiting for Mnesia tables for 30000 ms, 9 retries left 2024-09-25 10:14:41.711920+00:00 [info] <0.254.0> Successfully synced tables from a peer 2024-09-25 10:14:41.727975+00:00 [info] <0.254.0> Waiting for Mnesia tables for 30000 ms, 9 retries left 2024-09-25 10:14:41.728235+00:00 [info] <0.254.0> Successfully synced tables from a peer 2024-09-25 10:14:41.728643+00:00 [info] <0.254.0> Running boot step tracking_metadata_store defined by app rabbit 2024-09-25 10:14:41.728830+00:00 [info] <0.438.0> Setting up a table for connection tracking on this node: tracked_connection 2024-09-25 10:14:41.728983+00:00 [info] <0.438.0> Setting up a table for per-vhost connection counting on this node: tracked_connection_per_vhost 2024-09-25 10:14:41.729126+00:00 [info] <0.438.0> Setting up a table for per-user connection counting on this node: tracked_connection_per_user 2024-09-25 10:14:41.729220+00:00 [info] <0.438.0> Setting up a table for channel tracking on this node: tracked_channel 2024-09-25 10:14:41.729356+00:00 [info] <0.438.0> Setting up a table for channel tracking on this node: tracked_channel_per_user 2024-09-25 10:14:41.729561+00:00 [info] <0.254.0> Running boot step networking_metadata_store defined by app rabbit 2024-09-25 10:14:41.729914+00:00 [info] <0.254.0> Running boot step feature_flags defined by app rabbit 2024-09-25 10:14:41.730199+00:00 [info] <0.254.0> Running boot step codec_correctness_check defined by app rabbit 2024-09-25 10:14:41.730419+00:00 [info] <0.254.0> Running boot step external_infrastructure defined by app rabbit 2024-09-25 10:14:41.730532+00:00 [info] <0.254.0> Running boot step rabbit_event defined by app rabbit 2024-09-25 10:14:41.730778+00:00 [info] <0.254.0> Running boot step rabbit_registry defined by app rabbit 2024-09-25 10:14:41.731048+00:00 [info] <0.254.0> Running boot step rabbit_auth_mechanism_amqplain defined by app rabbit 2024-09-25 10:14:41.731369+00:00 [info] <0.254.0> Running boot step rabbit_auth_mechanism_anonymous defined by app rabbit 2024-09-25 10:14:41.731796+00:00 [info] <0.254.0> Running boot step rabbit_auth_mechanism_cr_demo defined by app rabbit 2024-09-25 10:14:41.732114+00:00 [info] <0.254.0> Running boot step rabbit_auth_mechanism_plain defined by app rabbit 2024-09-25 10:14:41.732285+00:00 [info] <0.254.0> Running boot step rabbit_exchange_type_direct defined by app rabbit 2024-09-25 10:14:41.732466+00:00 [info] <0.254.0> Running boot step rabbit_exchange_type_fanout defined by app rabbit 2024-09-25 10:14:41.732652+00:00 [info] <0.254.0> Running boot step rabbit_exchange_type_headers defined by app rabbit 2024-09-25 10:14:41.732900+00:00 [info] <0.254.0> Running boot step rabbit_exchange_type_local_random defined by app rabbit 2024-09-25 10:14:41.733384+00:00 [info] <0.254.0> Running boot step rabbit_exchange_type_topic defined by app rabbit 2024-09-25 10:14:41.733602+00:00 [info] <0.254.0> Running boot step rabbit_priority_queue defined by app rabbit 2024-09-25 10:14:41.733726+00:00 [info] <0.254.0> Priority queues enabled, real BQ is rabbit_variable_queue 2024-09-25 10:14:41.733965+00:00 [info] <0.254.0> Running boot step kernel_ready defined by app rabbit 2024-09-25 10:14:41.734029+00:00 [info] <0.254.0> Running boot step pg_local_amqp_connection defined by app rabbit 2024-09-25 10:14:41.742742+00:00 [info] <0.254.0> Running boot step pg_local_amqp_session defined by app rabbit 2024-09-25 10:14:41.743194+00:00 [info] <0.254.0> Running boot step rabbit_sysmon_minder defined by app rabbit 2024-09-25 10:14:41.743761+00:00 [info] <0.254.0> Running boot step rabbit_epmd_monitor defined by app rabbit 2024-09-25 10:14:41.744814+00:00 [info] <0.448.0> epmd monitor knows us, inter-node communication (distribution) port: 25672 2024-09-25 10:14:41.745069+00:00 [info] <0.254.0> Running boot step guid_generator defined by app rabbit 2024-09-25 10:14:41.752456+00:00 [info] <0.254.0> Running boot step rabbit_node_monitor defined by app rabbit 2024-09-25 10:14:41.752984+00:00 [info] <0.452.0> Starting rabbit_node_monitor (in ignore mode) 2024-09-25 10:14:41.753303+00:00 [info] <0.254.0> Running boot step delegate_sup defined by app rabbit 2024-09-25 10:14:41.754039+00:00 [info] <0.254.0> Running boot step rabbit_fifo_dlx_sup defined by app rabbit 2024-09-25 10:14:41.754307+00:00 [info] <0.254.0> Running boot step core_initialized defined by app rabbit 2024-09-25 10:14:41.754408+00:00 [info] <0.254.0> Running boot step rabbit_channel_tracking_handler defined by app rabbit 2024-09-25 10:14:41.754667+00:00 [info] <0.254.0> Running boot step rabbit_classic_queue defined by app rabbit 2024-09-25 10:14:41.755279+00:00 [info] <0.254.0> Running boot step rabbit_connection_tracking_handler defined by app rabbit 2024-09-25 10:14:41.755683+00:00 [info] <0.254.0> Running boot step rabbit_definitions_hashing defined by app rabbit 2024-09-25 10:14:41.756029+00:00 [info] <0.254.0> Running boot step rabbit_exchange_parameters defined by app rabbit 2024-09-25 10:14:41.770350+00:00 [info] <0.254.0> Running boot step rabbit_policies defined by app rabbit 2024-09-25 10:14:41.770968+00:00 [info] <0.254.0> Running boot step rabbit_policy defined by app rabbit 2024-09-25 10:14:41.771294+00:00 [info] <0.254.0> Running boot step rabbit_quorum_memory_manager defined by app rabbit 2024-09-25 10:14:41.771471+00:00 [info] <0.254.0> Running boot step rabbit_quorum_queue defined by app rabbit 2024-09-25 10:14:41.772310+00:00 [info] <0.254.0> Running boot step rabbit_stream_coordinator defined by app rabbit 2024-09-25 10:14:41.772690+00:00 [info] <0.254.0> Running boot step rabbit_vhost_limit defined by app rabbit 2024-09-25 10:14:41.773025+00:00 [info] <0.254.0> Running boot step rabbit_mgmt_reset_handler defined by app rabbitmq_management 2024-09-25 10:14:41.773294+00:00 [info] <0.254.0> Running boot step rabbit_mgmt_db_handler defined by app rabbitmq_management_agent 2024-09-25 10:14:41.773473+00:00 [info] <0.254.0> Management plugin: using rates mode 'basic' 2024-09-25 10:14:41.774030+00:00 [info] <0.254.0> Running boot step recovery defined by app rabbit 2024-09-25 10:14:41.802472+00:00 [info] <0.484.0> Making sure data directory '/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists 2024-09-25 10:14:41.821139+00:00 [info] <0.484.0> Starting message stores for vhost '/' 2024-09-25 10:14:41.826830+00:00 [info] <0.484.0> Started message store of type transient for vhost '/' 2024-09-25 10:14:41.837636+00:00 [warning] <0.498.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent" : recovery terms differ from present 2024-09-25 10:14:41.837636+00:00 [warning] <0.498.0> rebuilding indices from scratch 2024-09-25 10:16:22.565186+00:00 [info] <0.484.0> Started message store of type persistent for vhost '/' 2024-09-25 10:16:22.573534+00:00 [warning] <0.560.0> Queue QUEUE_1 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.573719+00:00 [warning] <0.551.0> Queue QUEUE_2 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.573730+00:00 [warning] <0.530.0> Queue QUEUE_3 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.579813+00:00 [warning] <0.569.0> Queue QUEUE_4 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.579854+00:00 [warning] <0.587.0> Queue QUEUE_5 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.579968+00:00 [warning] <0.590.0> Queue QUEUE_6 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.579960+00:00 [warning] <0.563.0> Queue QUEUE_7 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.580742+00:00 [warning] <0.542.0> Queue QUEUE_8 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.580705+00:00 [warning] <0.536.0> Queue QUEUE_9 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.580797+00:00 [warning] <0.554.0> Queue QUEUE_10 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.580984+00:00 [warning] <0.557.0> Queue QUEUE_11 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.580842+00:00 [warning] <0.584.0> Queue QUEUE_12 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.580964+00:00 [warning] <0.575.0> Queue QUEUE_13 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.580950+00:00 [warning] <0.533.0> Queue QUEUE_14 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.581669+00:00 [warning] <0.572.0> Queue QUEUE_15 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.581889+00:00 [warning] <0.545.0> Queue QUEUE_16 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.581997+00:00 [warning] <0.539.0> Queue QUEUE_17 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.582254+00:00 [warning] <0.578.0> Queue QUEUE_18 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.582898+00:00 [warning] <0.527.0> Queue QUEUE_19 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:22.582956+00:00 [warning] <0.566.0> Queue QUEUE_20 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:16:25.931677+00:00 [warning] <0.581.0> Queue AFFECTED_QUEUE_1 in vhost / dropped 2/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:17:41.034955+00:00 [warning] <0.548.0> Queue AFFECTED_QUEUE_2 in vhost / dropped 4/0/0 persistent messages and 0 transient messages after unclean shutdown 2024-09-25 10:17:41.127729+00:00 [info] <0.484.0> Recovering 22 queues of type rabbit_classic_queue took 179323ms 2024-09-25 10:17:41.127895+00:00 [info] <0.484.0> Recovering 0 queues of type rabbit_quorum_queue took 0ms 2024-09-25 10:17:41.128107+00:00 [info] <0.484.0> Recovering 0 queues of type rabbit_stream_queue took 0ms 2024-09-25 10:17:41.177029+00:00 [info] <0.254.0> Running boot step empty_db_check defined by app rabbit 2024-09-25 10:17:41.177229+00:00 [info] <0.254.0> Will not seed default virtual host and user: have definitions to load... 2024-09-25 10:17:41.177583+00:00 [info] <0.254.0> Running boot step rabbit_observer_cli defined by app rabbit 2024-09-25 10:17:41.177928+00:00 [info] <0.254.0> Running boot step rabbit_core_metrics_gc defined by app rabbit 2024-09-25 10:17:41.178569+00:00 [info] <0.254.0> Running boot step background_gc defined by app rabbit 2024-09-25 10:17:41.178930+00:00 [info] <0.254.0> Running boot step routing_ready defined by app rabbit 2024-09-25 10:17:41.179014+00:00 [info] <0.254.0> Running boot step pre_flight defined by app rabbit 2024-09-25 10:17:41.179184+00:00 [info] <0.254.0> Running boot step notify_cluster defined by app rabbit 2024-09-25 10:17:41.179448+00:00 [info] <0.254.0> Running boot step networking defined by app rabbit 2024-09-25 10:17:41.179776+00:00 [info] <0.254.0> Running boot step rabbit_quorum_queue_periodic_membership_reconciliation defined by app rabbit 2024-09-25 10:17:41.180197+00:00 [info] <0.254.0> Running boot step definition_import_worker_pool defined by app rabbit 2024-09-25 10:17:41.180494+00:00 [info] <0.339.0> Starting worker pool 'definition_import_pool' with 16 processes in it 2024-09-25 10:17:41.181565+00:00 [info] <0.254.0> Running boot step cluster_name defined by app rabbit 2024-09-25 10:17:41.181809+00:00 [info] <0.254.0> Running boot step virtual_host_reconciliation defined by app rabbit 2024-09-25 10:17:41.182421+00:00 [info] <0.254.0> Running boot step direct_client defined by app rabbit 2024-09-25 10:17:41.182657+00:00 [info] <0.254.0> Running boot step rabbit_management_load_definitions defined by app rabbitmq_management 2024-09-25 10:17:41.183460+00:00 [info] <0.678.0> Resetting node maintenance status 2024-09-25 10:17:41.492172+00:00 [warning] <0.714.0> Deprecated features: `management_metrics_collection`: Feature `management_metrics_collection` is deprecated. 2024-09-25 10:17:41.492172+00:00 [warning] <0.714.0> By default, this feature can still be used for now. 2024-09-25 10:17:41.492172+00:00 [warning] <0.714.0> Its use will not be permitted by default in a future minor RabbitMQ version and the feature will be removed from a future major RabbitMQ version; actual versions to be determined. 2024-09-25 10:17:41.492172+00:00 [warning] <0.714.0> To continue using this feature when it is not permitted by default, set the following parameter in your configuration: 2024-09-25 10:17:41.492172+00:00 [warning] <0.714.0> "deprecated_features.permit.management_metrics_collection = true" 2024-09-25 10:17:41.492172+00:00 [warning] <0.714.0> To test RabbitMQ as if the feature was removed, set this in your configuration: 2024-09-25 10:17:41.492172+00:00 [warning] <0.714.0> "deprecated_features.permit.management_metrics_collection = false" 2024-09-25 10:17:41.571355+00:00 [info] <0.751.0> Management plugin: HTTP (non-TLS) listener started on port 15672 2024-09-25 10:17:41.571722+00:00 [info] <0.779.0> Statistics database started. 2024-09-25 10:17:41.571922+00:00 [info] <0.778.0> Starting worker pool 'management_worker_pool' with 3 processes in it 2024-09-25 10:17:41.601022+00:00 [info] <0.793.0> Prometheus metrics: HTTP (non-TLS) listener started on port 15692 2024-09-25 10:17:41.601748+00:00 [info] <0.678.0> Applying definitions from regular file at /etc/rabbitmq/definitions.json 2024-09-25 10:17:41.603207+00:00 [info] <0.678.0> Applying definitions from file at '/etc/rabbitmq/definitions.json' 2024-09-25 10:17:41.603321+00:00 [info] <0.678.0> Asked to import definitions. Acting user: rmq-internal 2024-09-25 10:17:41.610657+00:00 [info] <0.678.0> Importing concurrently 16 users... 2024-09-25 10:17:41.620585+00:00 [info] <0.678.0> Importing concurrently 1 vhosts... 2024-09-25 10:17:41.627732+00:00 [info] <0.678.0> Importing concurrently 16 permissions... 2024-09-25 10:17:41.634987+00:00 [info] <0.676.0> Successfully set permissions for user 'USER_CLIENT' in virtual host '/' to '.*', '.*', '.*' 2024-09-25 10:17:41.642804+00:00 [info] <0.678.0> Importing concurrently 10 exchanges... 2024-09-25 10:17:41.643316+00:00 [info] <0.678.0> Importing sequentially 1 global runtime parameters... 2024-09-25 10:17:41.653060+00:00 [info] <0.678.0> Importing concurrently 20 queues... 2024-09-25 10:17:41.655141+00:00 [warning] <0.661.0> Deprecated features: `transient_nonexcl_queues`: Feature `transient_nonexcl_queues` is deprecated. 2024-09-25 10:17:41.655141+00:00 [warning] <0.661.0> By default, this feature can still be used for now. 2024-09-25 10:17:41.655141+00:00 [warning] <0.661.0> Its use will not be permitted by default in a future minor RabbitMQ version and the feature will be removed from a future major RabbitMQ version; actual versions to be determined. 2024-09-25 10:17:41.655141+00:00 [warning] <0.661.0> To continue using this feature when it is not permitted by default, set the following parameter in your configuration: 2024-09-25 10:17:41.655141+00:00 [warning] <0.661.0> "deprecated_features.permit.transient_nonexcl_queues = true" 2024-09-25 10:17:41.655141+00:00 [warning] <0.661.0> To test RabbitMQ as if the feature was removed, set this in your configuration: 2024-09-25 10:17:41.655141+00:00 [warning] <0.661.0> "deprecated_features.permit.transient_nonexcl_queues = false" 2024-09-25 10:17:41.657217+00:00 [info] <0.678.0> Importing concurrently 27 bindings... 2024-09-25 10:17:41.662350+00:00 [info] <0.678.0> Ready to start client connection listeners 2024-09-25 10:17:41.665544+00:00 [info] <0.861.0> started TCP listener on [::]:5672 2024-09-25 10:17:41.772098+00:00 [info] <0.881.0> started TLS (SSL) listener on [::]:5671 completed with 4 plugins. 2024-09-25 10:17:41.889731+00:00 [info] <0.678.0> Server startup complete; 4 plugins started. 2024-09-25 10:17:41.889731+00:00 [info] <0.678.0> * rabbitmq_prometheus 2024-09-25 10:17:41.889731+00:00 [info] <0.678.0> * rabbitmq_management 2024-09-25 10:17:41.889731+00:00 [info] <0.678.0> * rabbitmq_management_agent 2024-09-25 10:17:41.889731+00:00 [info] <0.678.0> * rabbitmq_web_dispatch 2024-09-25 10:17:41.987176+00:00 [info] <0.9.0> Time to start RabbitMQ: 184584 ms ```

lhoguin commented 1 week ago

Thanks all.

I think the working theory is correct. The odds of this happening are very low, but over the many RabbitMQ installations this is bound to happen regularly. I will think about a way to efficiently make sure the message found is a real message; a naive approach would be to check that the remainder here can be decoded with binary_to_term, if e.g. @gomoripeti you want to try.

gomoripeti commented 1 week ago

sure, if you mean that I create a patch with the naive approach (So far I couldn't trigger a compaction in a system test env with naive approaches so first I need to figure that out to verify the patch. Is there anything more to it than waiting 15 seconds after more than half of the messages in an rdq file (which is not the last one) are consumed?)

As the size of a msg cannot be larger than 512MB which can be stored on 4 bytes, what if the first byte of the size would be set to 255 (as a beginning of msg mark), or some other magic value in the first few bytes? Would that help at all?

My concern is that calling binary_to_term is fine at startup during recovery, but it might create too much garbage/too resource intensive during normal operation. What if there was a "secondary index", another ets table keeping track of "rdq file" -> "msg id" mappings, so it can be cross checked or serve as an input for the scanning. (There used to be a query to return all the msg ids for a given file, but it used the index table and was inefficient)

lhoguin commented 1 week ago

Careful as the naive approach is expensive so only do if you got the right test files.

Yes 15s since file access and half of the data gone.

It might be easier to trigger by sending messages that are basically just all bytes 255.

Ideally we could just make a better format, but it's a bit complex. This is best done later if/when we want to stop having the index fully in memory.

Right now my approach to fix this is that when we encounter a potential message we still check if there's a potential message in the next 8 bytes. Then when we cross reference with the index we should get the invalid messages dropped and only keep the valid ones, including the message we currently lose. The cross reference is strict (includes size and offset) so I don't expect any errors after that.

gomoripeti commented 1 week ago

we still check if there's a potential message in the next 8 bytes.

aha, so scanning could return overlapping messages, but cross checking would filter out the misinterpreted one

lhoguin commented 1 week ago

Right.

Got a test case:

diff --git a/deps/rabbit/test/backing_queue_SUITE.erl b/deps/rabbit/test/backing_queue_SUITE.erl
index 10129201b9..8b718ec81f 100644
--- a/deps/rabbit/test/backing_queue_SUITE.erl
+++ b/deps/rabbit/test/backing_queue_SUITE.erl
@@ -513,8 +513,7 @@ msg_store_file_scan(Config) ->
       ?MODULE, msg_store_file_scan1, [Config]).

 msg_store_file_scan1(Config) ->
-    Scan = fun (Blocks) ->
-        Expected = gen_result(Blocks),
+    ScanExp = fun (Blocks, Expected) ->
         Path = gen_msg_file(Config, Blocks),
         Result = rabbit_msg_store:scan_file_for_valid_messages(Path),
         case Result of
@@ -522,6 +521,9 @@ msg_store_file_scan1(Config) ->
             _ -> {expected, Expected, got, Result}
         end
     end,
+    Scan = fun (Blocks) ->
+        ScanExp(Blocks, gen_result(Blocks))
+    end,
     %% Empty files.
     ok = Scan([]),
     ok = Scan([{pad, 1024}]),
@@ -629,6 +631,10 @@ msg_store_file_scan1(Config) ->
     %% Messages with no content.
     ok = Scan([{bin, <<0:64, "deadbeefdeadbeef", 255>>}]),
     ok = Scan([{msg, gen_id(), <<>>}]),
+    %% Sneaky messages.
+    ok = ScanExp(
+        [{bin, <<0, 0:48, 17, 17, "idididididididid", 255, 0:5909/unit:8, 255>>}],
+        {ok, [{<<"idididididididid">>, 5920, 1}], 5921}),
     %% All good!!
     passed.

This fails with:

backing_queue_SUITE > backing_queue_tests > msg_store_file_scan
    #1. {error,
            {{exception,
                 {badmatch,
                     {expected,
                         {ok,[{<<"idididididididid">>,5920,1}],5921},
                         got,
                         {ok,[{<<17,105,100,105,100,105,100,105,100,105,100,
                                 105,100,105,100,105>>,
                               26,0}],
                             26}}},

So the first attempt with <<0:56,17>> is accepted as the message, and it ends up eating the real message. Now to write the fix I mentioned.

lhoguin commented 1 week ago

Actually we should probably do the index_lookup immediately, while scanning. We are going to do it anyway in both locations where that function is used, so we may as well lookup now.

The reason we want to do this is because even if we try the next 8 bytes, I don't know how much data I need to skip after doing that. While if we lookup immediately while scanning, I know whether the message is real or not.

I think what needs to happen is that the fun from the two lists:foldl calls after scanning must instead be called while scanning every time we find a potential message. These funs don't need to change much, just tell us whether the message is to be kept or not.

gomoripeti commented 1 week ago

how does this work in case of node recovery? is the index already populated with msg ids from queue indexes before reading the rdq files?

lhoguin commented 1 week ago

Yes.

lhoguin commented 1 week ago

I have opened https://github.com/rabbitmq/rabbitmq-server/pull/12392 with a patch. I have only tested via unit tests so this will require more extensive testing.

I had to disable file scanning when we delete files, for now. Checking against the index in that case is too costly, because when the message isn't in the index we try the next byte, we don't skip the message. And deleted files are not supposed to have anything in the index (or in the file). This was basically just an assert so I am considering just removing it.

lhoguin commented 1 week ago

Needs more work, so don't bother testing it. Just noticed a big error.

michaelklishin commented 1 week ago

@netrmqdev we can provide you a one-off build to test #12392. What kind of package do you use/need? Would you be able to test such a one-off build at all (it's fine if you wouldn't, of course)

netrmqdev commented 1 week ago

@netrmqdev we can provide you a one-off build to test #12392. What kind of package do you use/need? Would you be able to test such a one-off build at all (it's fine if you wouldn't, of course)

I can deploy a special build on some environments where the issue happened, yes. This will give a good idea of the patch fixing things or not.

We deploy a custom Docker image based off the rabbitmq management one (with some additional certificates, conf and definitions):

FROM rabbitmq:3.12.14-management
(...)

If you can provide a tagged rabbitmq management image (based on 4.0.xx I imagine), that should be fairly easy for me I believe.

michaelklishin commented 1 week ago

@netrmqdev this (x86-64) tag should include #12392 (acb04e9): https://hub.docker.com/layers/pivotalrabbitmq/rabbitmq/sha-acb04e929f9b1030dd077c62991147876b08fff5/images/sha256-353eb18c3b4d98fd393eacaf98234a90bfc582f0c6d1f38232f8a4317b0cf4be?context=explore

It is technically 4.1.0-based but there are no meaningful changes compared to 4.0.2 (except for this PR in the image, that is).

Please give it a shot and let us know how it goes.

netrmqdev commented 1 week ago

Thanks @michaelklishin , yes 4.1 based is ok. I will try to deploy the image on Monday and keep you updated on the results.

netrmqdev commented 5 days ago

I deployed the image this morning on the same env from the previous logs in 4.0.x, and it did not go well. The RMQ server fully crashed soon after starting. The stack trace is:

2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>     [{rabbit_msg_store,scan_file_for_valid_messages,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                        [{file,"rabbit_msg_store.erl"},{line,1487}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {rabbit_msg_store,scan_and_vacuum_message_file,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                        [{file,"rabbit_msg_store.erl"},{line,2108}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {rabbit_msg_store,compact_file,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                        [{file,"rabbit_msg_store.erl"},{line,1943}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {rabbit_msg_store_gc,attempt_action,3,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                           [{file,"rabbit_msg_store_gc.erl"},{line,110}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {rabbit_msg_store_gc,handle_cast,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                           [{file,"rabbit_msg_store_gc.erl"},{line,68}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1056}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,251}]}]}

The "full" log (messages truncated) of the crash

``` 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> ** Reason for termination == 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> ** {noproc,{gen_server2,call,[<0.520.0>,stop,infinity]}} 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> ** In 'terminate' callback with reason == 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> ** {{case_clause,[{msg_location,<<...>>, 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> 2,0,16595041,123102}, (...) 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> {msg_location,<<...>>, 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> 2,0,20157,5435}]}, 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> [{rabbit_msg_store,scan_file_for_valid_messages,2, 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> [{file,"rabbit_msg_store.erl"},{line,1487}]}, 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> {rabbit_msg_store,scan_and_vacuum_message_file,2, 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> [{file,"rabbit_msg_store.erl"},{line,2108}]}, 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> {rabbit_msg_store,compact_file,2, 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> [{file,"rabbit_msg_store.erl"},{line,1943}]}, 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> {rabbit_msg_store_gc,attempt_action,3, 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> [{file,"rabbit_msg_store_gc.erl"},{line,110}]}, 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> {rabbit_msg_store_gc,handle_cast,2, 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> [{file,"rabbit_msg_store_gc.erl"},{line,68}]}, 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> {gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1056}]}, 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,251}]}]} 2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> crasher: 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> initial call: rabbit_msg_store:init/1 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> pid: <0.517.0> 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> registered_name: [] 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> exception exit: {noproc,{gen_server2,call,[<0.520.0>,stop,infinity]}} 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> in function gen_server2:terminate/3 (gen_server2.erl, line 1161) 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> ancestors: [<0.502.0>,<0.501.0>,rabbit_vhost_sup_sup,rabbit_sup, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> <0.254.0>] 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> message_queue_len: 4 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> messages: [{'$gen_cast', 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> {write, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> <<...>>, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> 11668, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> <<...>>, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> flow}}, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> {'$gen_cast', 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> {write, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> <<...>>, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> 7404, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> <<...>>, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> flow}}, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> {'$gen_cast', 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> {write, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> <<...>>, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> 11668, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> <<...>>, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> flow}}, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> {'$gen_cast', 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> {remove, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> <<...>>, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> [{7404, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> <<...>>}]}}] 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> links: [<0.502.0>] 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> dictionary: [{rand_seed,{#{max => 288230376151711743,type => exsplus, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> next => #Fun, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> jump => #Fun}, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> [225333775202711325|206966345018381886]}}, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> {{credit_to,<0.530.0>},505}, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> {{credit_to,<0.554.0>},505}, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> {{credit_to,<0.890.0>},664}, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> {{credit_to,<0.566.0>},40}, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> {credit_blocked,[]}, 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> {{credit_to,<0.569.0>},143}] 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> trap_exit: true 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> status: running 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> heap_size: 833026 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> stack_size: 28 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> reductions: 3618356 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> neighbours: 2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> supervisor: {<0.502.0>,rabbit_vhost_sup} 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> errorContext: child_terminated 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> reason: {noproc,{gen_server2,call,[<0.520.0>,stop,infinity]}} 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> offender: [{pid,<0.517.0>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {id,msg_store_persistent}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {mfargs, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {rabbit_msg_store,start_link, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> [<<"/">>,msg_store_persistent, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> "/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L", 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> [], 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {#Fun, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {start, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> [{resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue,<<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}]}}]}}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {restart_type,transient}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {significant,false}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {shutdown,600000}, 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> {child_type,worker}] 2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> supervisor: {<0.502.0>,rabbit_vhost_sup} 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> errorContext: shutdown 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> reason: reached_max_restart_intensity 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> offender: [{pid,<0.517.0>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {id,msg_store_persistent}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {mfargs, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {rabbit_msg_store,start_link, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> [<<"/">>,msg_store_persistent, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> "/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L", 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> [], 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {#Fun, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {start, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> [{resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue,<<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {resource,<<"/">>,queue, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> <<"QUEUE_NAME">>}]}}]}}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {restart_type,transient}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {significant,false}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {shutdown,600000}, 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> {child_type,worker}] 2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> 2024-09-30 09:48:52.430038+00:00 [info] <0.513.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' 2024-09-30 09:48:52.438946+00:00 [info] <0.513.0> Message store for directory '/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' is stopped 2024-09-30 09:48:52.454958+00:00 [error] <0.501.0> supervisor: {<0.501.0>,rabbit_vhost_sup_wrapper} 2024-09-30 09:48:52.454958+00:00 [error] <0.501.0> errorContext: child_terminated 2024-09-30 09:48:52.454958+00:00 [error] <0.501.0> reason: shutdown 2024-09-30 09:48:52.454958+00:00 [error] <0.501.0> offender: [{pid,<0.502.0>}, 2024-09-30 09:48:52.454958+00:00 [error] <0.501.0> {id,rabbit_vhost_sup}, 2024-09-30 09:48:52.454958+00:00 [error] <0.501.0> {mfargs,{rabbit_vhost_sup_wrapper,start_vhost_sup,[<<"/">>]}}, 2024-09-30 09:48:52.454958+00:00 [error] <0.501.0> {restart_type,permanent}, 2024-09-30 09:48:52.454958+00:00 [error] <0.501.0> {significant,false}, 2024-09-30 09:48:52.454958+00:00 [error] <0.501.0> {shutdown,infinity}, 2024-09-30 09:48:52.454958+00:00 [error] <0.501.0> {child_type,supervisor}] 2024-09-30 09:48:52.454958+00:00 [error] <0.501.0> 2024-09-30 09:48:52.455646+00:00 [info] <0.503.0> Virtual host '/' is stopping 2024-09-30 09:48:52.456043+00:00 [info] <0.1672.0> Closing all connections in vhost '/' on node 'USER@rabbitmq' because the vhost is stopping ```

lhoguin commented 5 days ago

Thanks! I have indeed missed something.

lhoguin commented 4 days ago

Please try the most up to date image which should be https://hub.docker.com/layers/pivotalrabbitmq/rabbitmq/loic-fix-cq-scan/images/sha256-ab439bb28985d6699f8b5a7c6ca91ca0d183202521f808b888c874b41146c02e?context=explore

netrmqdev commented 4 days ago

Please try the most up to date image which should be https://hub.docker.com/layers/pivotalrabbitmq/rabbitmq/loic-fix-cq-scan/images/sha256-ab439bb28985d6699f8b5a7c6ca91ca0d183202521f808b888c874b41146c02e?context=explore

Thanks a lot, I deployed it this morning on the first env and so far so good 👍 If it's still good tomorrow, I will deploy it on more environments and let it run for a few days.

kaistierl commented 3 days ago

Hi all! We're observing the same issue. We went from 3.10.0 on CentOS7 to 3.13.2 on Debian 12 last Thursday and since then had those crashes every night in our canary environment where we mirror live traffic of our application and test new releases before staging them to the live environment. Last night I tried out 4.0.2 and the crashes were still happening, although RabbitMQ managed to automatically restart affected queues, what didn't work with 3.13.2. I now went back to 3.12.14 and will again observe for a night or two.

Long story short: If someone can provide a DEB package with the attempted fix, I can do some real world testing of the fix in our canary environment afterwards. Since we have no real users on there, It would be quite easy for me to do.

lhoguin commented 3 days ago

I'm sure @michaelklishin can help.

michaelklishin commented 3 days ago

@kaistierl I will build a 64-bit Debian package (off of main plus this PR) later today.

michaelklishin commented 2 days ago

@kaistierl would an ARM64 build do? Our x86-64 environment that I used to produce one-off builds needs to be re-created on different infrastructure for reasons outside of my control, and it will take some time.

But I can produce a local ARM64 Debian package.

kaistierl commented 1 day ago

@michaelklishin unfortunately not - I would need a x86-64 package.

...by the way: After downgrading to 3.12.14 everything is running stable again for me, so my observations match what was reported by others.

michaelklishin commented 1 day ago

Here is a generic binary build of #12392 which is platform-independent https://github.com/rabbitmq/rabbitmq-server/actions/runs/11120630513/artifacts/1999351253.

I cannot promise any ETA for one-off x86-64 builds.

michaelklishin commented 1 day ago

Note that both 3.12 and 3.13 are out of community support so when a new patch release is out, you will have to move to 4.0.x, a 3.13.x release will be available later and to paying customers only.

michaelklishin commented 1 day ago

@netrmqdev any updates from your clusters?

netrmqdev commented 1 day ago

@netrmqdev any updates from your clusters?

For reasons unrelated to RabbitMQ, I was unable to get as much testing as I wanted to on my envs. However I now have about 4 of them running the patch for ~12-24 hours each, with no issues so far. They will keep running over the week-end so on Monday I can give more feedback 👍

gomoripeti commented 1 day ago

@kaistierl as the patch only modifies one file you can extract rabbit_msg_store.beam from the generic build Michael provided and copy it into a vanilla 4.0.2 install (under `/plugins/rabbit-4.0.2/ebin/>) I'm mentioning this because testing by multiple parties would be highly appreciated (and we cannot reproduce it easily with real traffic)

michaelklishin commented 20 hours ago

@gomoripeti RabbitMQ remains open source software under a specific and well documented support policy.

Except for serious enough security patches, there will be no more OSS RabbitMQ 3.13.x releases. Those who expect 3.13.x releases must pay and get them as part of Tanzu RabbitMQ.

michaelklishin commented 20 hours ago

Let's keep this discussion specific to this issue. Any attempts to convince our team to produce a 3.13.x release will be removed. This is something non-negotiable.

The patch will be backported to v3.13.x, that's all the core team can offer to non-paying (to VMware/Broadcom) 3.13.x users.

rabbitmq / rabbitmq-server