Open netrmqdev opened 1 week ago
@netrmqdev how large are those "big messages"?
Hello! The first log is probably missing a few lines at the beginning, the line just above what you copied is important.
Another question is: how long should this "300 message/second, large messages" workload run before this issue can be hit.
Last time @lhoguin and @mkuratczyk had to investigate something similar it took a few weeks to find a reasonably reliable way to reproduce. So any details would be appreciated.
@netrmqdev how large are those "big messages"?
The size varies a lot between a few KB to a few MB, but most messages are less than 1MB.
Another question is: how long should this "300 message/second, large messages" workload run before this issue can be hit.
Last time @lhoguin and @mkuratczyk had to investigate something similar it took a few weeks to find a reasonably reliable way to reproduce. So any details would be appreciated.
I wish I had identified a reproductible pattern, but:
Hello! The first log is probably missing a few lines at the beginning, the line just above what you copied is important.
I've found a similar occurence where I have access to more logs than the first one (the ingestion of the logs into Grafana makes things a bit messy in this case):
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0> trap_exit: true
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0> status: running
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0> heap_size: 66222786
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0> stack_size: 28
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0> reductions: 1274977716
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0> neighbours:
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0>
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> supervisor: {<0.584.0>,rabbit_amqqueue_sup}
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> errorContext: child_terminated
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> reason: {function_clause,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> [{rabbit_msg_store,reader_pread_parse,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> [[eof]],
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> [{file,"rabbit_msg_store.erl"},{line,692}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {rabbit_msg_store,reader_pread,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> [{file,"rabbit_msg_store.erl"},{line,688}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {rabbit_msg_store,read_from_disk,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> [{file,"rabbit_msg_store.erl"},{line,648}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {rabbit_msg_store,client_read3,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> [{file,"rabbit_msg_store.erl"},{line,640}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {rabbit_variable_queue,with_msg_store_state,3,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> [{file,"rabbit_variable_queue.erl"},{line,1348}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {rabbit_variable_queue,read_msg,5,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> [{file,"rabbit_variable_queue.erl"},{line,1583}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {rabbit_variable_queue,fetch,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> [{file,"rabbit_variable_queue.erl"},{line,616}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {rabbit_priority_queue,fetch,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> [{file,"rabbit_priority_queue.erl"},{line,302}]}]}
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> offender: [{pid,<0.585.0>},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {id,rabbit_amqqueue},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {mfargs,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {rabbit_prequeue,start_link,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> [{amqqueue,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {resource,<<"/">>,queue,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> <<"QUEUE_NAME">>},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> true,false,none,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> [{<<"x-queue-mode">>,longstr,<<"lazy">>}],
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> <0.878.0>,[],[],[],undefined,undefined,[],
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> undefined,live,0,[],<<"/">>,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> #{user => <<"rmq-internal">>},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> rabbit_classic_queue,#{}},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> recovery,<0.583.0>]}},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {restart_type,transient},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {significant,true},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {shutdown,600000},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> {child_type,worker}]
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>
2024-09-18 22:52:26.317526+00:00 [error] <0.93868.0> Restarting crashed queue 'QUEUE_NAME' in vhost '/'
@netrmqdev there is no backlog of messages from old versions, right? if you are saying it can happen a few days after an upgrade, I assume there are no messages older than that, correct? I just want to rule out anything upgrade related.
Side note: this queue is configured as lazy
. That's not a thing since 3.12:
https://www.rabbitmq.com/blog/2023/05/17/rabbitmq-3.12-performance-improvements#classic-queues-changes-to-the-lazy-mode
we are also investigating a similar/same case: two single-node clusters were upgraded from 3.12.14 -> 3.13.7 They have similar topology and traffic patterns: a lot of vhosts each with one of the queues getting most of the traffic. The messages are ~20KB mixed with some ~200KB. The payloads are binary (so can contain any byte including 255). The crashing queues are long 4000-10,000 messages. The persistent msg store of the vhost has many GBs of data. (Probably dominantly belonging to these single long queues). The msgs are published via direct exchange to the queues (no fanout).
We couldn't reproduce the issue in a test environment so we don't know yet what attributes of the usage are relevant.
The queues crash when trying to fetch a message from the shared msg store, but the location does not match the content of the rdq file:
** Reason for termination ==
** {function_clause,
[{rabbit_msg_store,reader_pread_parse,
[[eof]],
[{file,"rabbit_msg_store.erl"},{line,692}]},
{rabbit_msg_store,reader_pread,2,
[{file,"rabbit_msg_store.erl"},{line,688}]},
{rabbit_msg_store,read_from_disk,2,
[{file,"rabbit_msg_store.erl"},{line,648}]},
{rabbit_msg_store,client_read3,2,
[{file,"rabbit_msg_store.erl"},{line,640}]},
{rabbit_variable_queue,with_msg_store_state,3,
[{file,"rabbit_variable_queue.erl"},{line,1348}]},
{rabbit_variable_queue,read_msg,5,
[{file,"rabbit_variable_queue.erl"},{line,1583}]},
{rabbit_variable_queue,fetch,2,
[{file,"rabbit_variable_queue.erl"},{line,616}]},
{rabbit_priority_queue,fetch,2,
[{file,"rabbit_priority_queue.erl"},{line,302}]}]}
or
** Reason for termination ==
** {{badmatch,<<51,3,137,141,241,159,1,151,204,12,68,12,119,111,64,110,26,150,
...
201,118,180,244,157,24,27,183,26,150,192,28,133,226,173,104>>},
[{rabbit_msg_store,reader_pread_parse,1,
[{file,"rabbit_msg_store.erl"},{line,696}]},
{rabbit_msg_store,reader_pread,2,
[{file,"rabbit_msg_store.erl"},{line,688}]},
{rabbit_msg_store,read_from_disk,2,
[{file,"rabbit_msg_store.erl"},{line,648}]},
{rabbit_msg_store,client_read3,2,
[{file,"rabbit_msg_store.erl"},{line,640}]},
{rabbit_variable_queue,with_msg_store_state,3,
[{file,"rabbit_variable_queue.erl"},{line,1348}]},
{rabbit_variable_queue,read_msg,5,
[{file,"rabbit_variable_queue.erl"},{line,1583}]},
{rabbit_variable_queue,fetch,2,
[{file,"rabbit_variable_queue.erl"},{line,616}]},
{rabbit_priority_queue,fetch,2,
[{file,"rabbit_priority_queue.erl"},{line,302}]}]}
I suspect that compaction somehow corrupts the rdq file or gets the rabbit_msg_store_ets_index out of sync.
Compaction should work in a way that it moves still referenced messages from the end towards the beginning of the file. However I could not find the problematic msg Id anywhere in the rdq file. So it was lost somehow from the file.
We could not rule out 100% that there is some data from before the upgrade, but some crashes happened more than a week after the upgrade.
@netrmqdev there is no backlog of messages from old versions, right? if you are saying it can happen a few days after an upgrade, I assume there are no messages older than that, correct? I just want to rule out anything upgrade related.
No we flush everything before updating RMQ. The only exception is when I tested the 3.13.7 -> 4.0.1 migration there was some older messages as I did not want to lose data, but that's the only case.
Side note: this queue is configured as
lazy
. That's not a thing since 3.12: https://www.rabbitmq.com/blog/2023/05/17/rabbitmq-3.12-performance-improvements#classic-queues-changes-to-the-lazy-mode
Indeed, I noticed that while investigating and I plan to clean that very soon!
When the queue restarts it should tell you how many persistent messages were dropped. Better yet, killing the node and restarting it will log a few more interesting things as it will force a full node recovery and may drop messages both from the queue and from the message store, if they're truly gone. Would be great to get the logs for a node restarted like this. It would help confirm whether rabbit ate the messages or whether the message store index or queue are out of sync.
Has any of these environments run out of disk at some point?
Has any of these environments run out of disk at some point?
No they have a lot of free space
When the queue restarts it should tell you how many persistent messages were dropped. Better yet, killing the node and restarting it will log a few more interesting things as it will force a full node recovery and may drop messages both from the queue and from the message store, if they're truly gone. Would be great to get the logs for a node restarted like this. It would help confirm whether rabbit ate the messages or whether the message store index or queue are out of sync.
I believe the dropped messages log when the queue restarts only exists in 4.0.x, so I only have those 3 logs to share (all from the same server):
2024-09-23 10:30:56.020462+00:00 [error] <0.6365.0> Restarting crashed queue 'QUEUE_NAME_1' in vhost '/'.
(...)
2024-09-23 10:33:15.393172+00:00 [warning] <0.6365.0> Queue QUEUE_NAME_1 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-23 13:37:26.358731+00:00 [error] <0.88507.0> Restarting crashed queue 'QUEUE_NAME_2' in vhost '/'.
2024-09-23 13:37:28.046531+00:00 [warning] <0.88507.0> Queue QUEUE_NAME_2 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-24 14:53:39.553861+00:00 [error] <0.269219.0> Restarting crashed queue 'QUEUE_NAME_2' in vhost '/'.
2024-09-24 14:53:42.868098+00:00 [warning] <0.269219.0> Queue QUEUE_NAME_2 in vhost / dropped 48/0/0 persistent messages and 0 transient messages after unclean shutdown
Thanks. No the log message exists before 4.0 but only for CQv2, I suppose you were using CQv1.
The first two crashes don't drop any message.
QUEUE_NAME_2 in vhost / dropped 48/0/0 persistent messages and 0 transient messages after unclean shutdown
This one tells us that the queue failed to find 48 of its messages in the shared message store (more precisely in its index).
Triggering a full restart by killing the node (I insist on killing the node, brutally shutting down the container or the node's process, as a normal restart won't do) will provide us with additional information in the logs to determine where those messages are.
There are two types of cases I investigated
The first ones when the crash was actively happening, the second ones just inspecting backup of the data dir (without any logs or in-memory info available)
First case I was able to trace the msg store index lookup by
recon_trace:calls([{rabbit_msg_store, index_lookup_positive_ref_count, return_trace}], 10, [{pid, amqqueue:get_pid(element(2, rabbit_amqqueue:lookup(rabbit_misc:r(<<"VHOST">>,queue,<<"CRASHING_QUEUE">>))))}]).
And the MsgId was always found in the index with ref count = 1, eg:
rabbit_msg_store:index_lookup_positive_ref_count/2 --> {msg_location,<<42,0,200,124,102,232,152,34,183,160,227,152,159,25,95,231>>,
1,33915,6222253,378631}
The crash contains the queue state, from field q3
I can find the SeqId<->MsgId mappings (and infer publish order). eg:
[{msg_status,77555,
<<42,0,200,124,102,232,152,34,183,160,227,152,159,25,95,231>>,
undefined,true,true,rabbit_msg_store,true,
msg_store,
{message_properties,undefined,false,377945}},
{msg_status,77556,
<<130,242,201,36,170,75,117,91,53,82,42,188,0,214,50,151>>,
undefined,true,undefined,rabbit_msg_store,true,
msg_store,
{message_properties,undefined,false,18329}},
{msg_status,77557,
<<198,29,81,216,243,153,73,243,243,79,123,100,248,82,90,220>>,
undefined,true,undefined,rabbit_msg_store,true,
msg_store,
{message_properties,undefined,false,18329}},
...
Processed 33915.rdq and associated entries with queue sequence Ids. The first message 77555 is missing, but all the others are from the head (q3) of the crashed queue
The badmatch crash has a huge binary in it (luckily). I managed to identify messages 77623-77614 in them. Because there are no zeros between them and they are a continuous reversed block I suspected they were written there in one go during a compaction - potentially overwriting the missing 77555 message.
In the second cases I only had the queue index (CQv1) on disk and the rdq files.
In one example the first index segment file starts with (format {RelSeqId, MsgId, Size}
)
{137,del_or_ack},
{137,del_or_ack},
{137,<<29,116,45,4,62,102,142,6,42,153,250,128,55,253, 101,164>>, 50457}},
{138,del_or_ack},
{138,<<146,223,62,74,141,134,33,54,203,94,55,94,185,164, 20,44>>, 204_137}},
{139,del_or_ack},
{139,<<26,133,18,250,208,12,72,50,90,199,205,23,190,128, 98,196>>, 95033}},
...
The first message 137 is acked, so should have ref count = 0 in the msg store. The second message 138 must be the one that causes the crash
I speculate the last compaction was maybe triggered by the ack of 137. This is supported by file timestamps as well. Both 137 and 139 are present in the rdq file (but 137 is not moved towards the beginning). There is an interesting 10MB zero-hole in the middle (right before 137) which is beyond the latest truncation line, so must be a result of a previous compaction. I haven't untangled from the order of SeqIds which compaction might have moved which message where and what could have been the original position of the missing message.
Theoretically it is possible that rabbit_msg_store:scan_file_for_valid_messages/1
misinterprets the content of an rdq file and this way the compaction loses and overwrites a live message. This can happen if a zero-hole is followed by a message, and there is a byte 255 in the appropriate place in the payload.
MsgId = binary:copy(<<"1">>, 16).
MC =
#mc{protocol = mc_amqpl,
data = #content{class_id = 60,properties = none,
properties_bin = <<48,0,0,0,0,0,2>>,
protocol = rabbit_framing_amqp_0_9_1,
payload_fragments_rev = [binary:copy(<<255>>, 26000)]},
annotations = #{id => MsgId,
p => 0,
x => <<>>,
rk => [<<"queue1">>],
rts => 1726812956209}}.
BinMsg = term_to_binary(MC).
BinMsgSize = byte_size(BinMsg).
Entry = <<(BinMsgSize + 16):(8*8), MsgId:16/binary, BinMsg/binary, 255>>.
%% This is how `scan_file_for_valid_messages` matches on messages
<<Size2:(8*8), MsgId2:16/binary, Msg2:Size2/binary, 255, _/binary>> = <<0, Entry/binary>>.
MsgId2 == MsgId.
The payload size of 26000 bytes is selected (binary for <<0,0,0,0,0,102,68>>
) so with an extra leading zero it can be interpreted as size 102 bytes. The payload in the term_to_binary format of a #mc{}
starts after 85 bytes (plus 16 byte message id = 101).
This is just speculation and I don't know how likely it is to hit this coincidence.
Thanks. No the log message exists before 4.0 but only for CQv2, I suppose you were using CQv1.
Oh yes indeed, all our queues are in CQv1.
The first two crashes don't drop any message.
QUEUE_NAME_2 in vhost / dropped 48/0/0 persistent messages and 0 transient messages after unclean shutdown
This one tells us that the queue failed to find 48 of its messages in the shared message store (more precisely in its index).
Triggering a full restart by killing the node (I insist on killing the node, brutally shutting down the container or the node's process, as a normal restart won't do) will provide us with additional information in the logs to determine where those messages are.
I tried that, but I'm not sure it brings a lot more information, although we can see the 2 affected queues dropping messages on restart:
2024-09-25 10:16:25.931677+00:00 [warning] <0.581.0> Queue AFFECTED_QUEUE_1 in vhost / dropped 2/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:17:41.034955+00:00 [warning] <0.548.0> Queue AFFECTED_QUEUE_2 in vhost / dropped 4/0/0 persistent messages and 0 transient messages after unclean shutdown
Thanks all.
I think the working theory is correct. The odds of this happening are very low, but over the many RabbitMQ installations this is bound to happen regularly. I will think about a way to efficiently make sure the message found is a real message; a naive approach would be to check that the remainder here can be decoded with binary_to_term
, if e.g. @gomoripeti you want to try.
sure, if you mean that I create a patch with the naive approach (So far I couldn't trigger a compaction in a system test env with naive approaches so first I need to figure that out to verify the patch. Is there anything more to it than waiting 15 seconds after more than half of the messages in an rdq file (which is not the last one) are consumed?)
As the size of a msg cannot be larger than 512MB which can be stored on 4 bytes, what if the first byte of the size would be set to 255 (as a beginning of msg mark), or some other magic value in the first few bytes? Would that help at all?
My concern is that calling binary_to_term is fine at startup during recovery, but it might create too much garbage/too resource intensive during normal operation. What if there was a "secondary index", another ets table keeping track of "rdq file" -> "msg id" mappings, so it can be cross checked or serve as an input for the scanning. (There used to be a query to return all the msg ids for a given file, but it used the index table and was inefficient)
Careful as the naive approach is expensive so only do if you got the right test files.
Yes 15s since file access and half of the data gone.
It might be easier to trigger by sending messages that are basically just all bytes 255.
Ideally we could just make a better format, but it's a bit complex. This is best done later if/when we want to stop having the index fully in memory.
Right now my approach to fix this is that when we encounter a potential message we still check if there's a potential message in the next 8 bytes. Then when we cross reference with the index we should get the invalid messages dropped and only keep the valid ones, including the message we currently lose. The cross reference is strict (includes size and offset) so I don't expect any errors after that.
we still check if there's a potential message in the next 8 bytes.
aha, so scanning could return overlapping messages, but cross checking would filter out the misinterpreted one
Right.
Got a test case:
diff --git a/deps/rabbit/test/backing_queue_SUITE.erl b/deps/rabbit/test/backing_queue_SUITE.erl
index 10129201b9..8b718ec81f 100644
--- a/deps/rabbit/test/backing_queue_SUITE.erl
+++ b/deps/rabbit/test/backing_queue_SUITE.erl
@@ -513,8 +513,7 @@ msg_store_file_scan(Config) ->
?MODULE, msg_store_file_scan1, [Config]).
msg_store_file_scan1(Config) ->
- Scan = fun (Blocks) ->
- Expected = gen_result(Blocks),
+ ScanExp = fun (Blocks, Expected) ->
Path = gen_msg_file(Config, Blocks),
Result = rabbit_msg_store:scan_file_for_valid_messages(Path),
case Result of
@@ -522,6 +521,9 @@ msg_store_file_scan1(Config) ->
_ -> {expected, Expected, got, Result}
end
end,
+ Scan = fun (Blocks) ->
+ ScanExp(Blocks, gen_result(Blocks))
+ end,
%% Empty files.
ok = Scan([]),
ok = Scan([{pad, 1024}]),
@@ -629,6 +631,10 @@ msg_store_file_scan1(Config) ->
%% Messages with no content.
ok = Scan([{bin, <<0:64, "deadbeefdeadbeef", 255>>}]),
ok = Scan([{msg, gen_id(), <<>>}]),
+ %% Sneaky messages.
+ ok = ScanExp(
+ [{bin, <<0, 0:48, 17, 17, "idididididididid", 255, 0:5909/unit:8, 255>>}],
+ {ok, [{<<"idididididididid">>, 5920, 1}], 5921}),
%% All good!!
passed.
This fails with:
backing_queue_SUITE > backing_queue_tests > msg_store_file_scan
#1. {error,
{{exception,
{badmatch,
{expected,
{ok,[{<<"idididididididid">>,5920,1}],5921},
got,
{ok,[{<<17,105,100,105,100,105,100,105,100,105,100,
105,100,105,100,105>>,
26,0}],
26}}},
So the first attempt with <<0:56,17>>
is accepted as the message, and it ends up eating the real message. Now to write the fix I mentioned.
Actually we should probably do the index_lookup
immediately, while scanning. We are going to do it anyway in both locations where that function is used, so we may as well lookup now.
The reason we want to do this is because even if we try the next 8 bytes, I don't know how much data I need to skip after doing that. While if we lookup immediately while scanning, I know whether the message is real or not.
I think what needs to happen is that the fun from the two lists:foldl
calls after scanning must instead be called while scanning every time we find a potential message. These funs don't need to change much, just tell us whether the message is to be kept or not.
how does this work in case of node recovery? is the index already populated with msg ids from queue indexes before reading the rdq files?
Yes.
I have opened https://github.com/rabbitmq/rabbitmq-server/pull/12392 with a patch. I have only tested via unit tests so this will require more extensive testing.
I had to disable file scanning when we delete files, for now. Checking against the index in that case is too costly, because when the message isn't in the index we try the next byte, we don't skip the message. And deleted files are not supposed to have anything in the index (or in the file). This was basically just an assert so I am considering just removing it.
Needs more work, so don't bother testing it. Just noticed a big error.
@netrmqdev we can provide you a one-off build to test #12392. What kind of package do you use/need? Would you be able to test such a one-off build at all (it's fine if you wouldn't, of course)
@netrmqdev we can provide you a one-off build to test #12392. What kind of package do you use/need? Would you be able to test such a one-off build at all (it's fine if you wouldn't, of course)
I can deploy a special build on some environments where the issue happened, yes. This will give a good idea of the patch fixing things or not.
We deploy a custom Docker image based off the rabbitmq management one (with some additional certificates, conf and definitions):
FROM rabbitmq:3.12.14-management
(...)
If you can provide a tagged rabbitmq management image (based on 4.0.xx I imagine), that should be fairly easy for me I believe.
@netrmqdev this (x86-64) tag should include #12392 (acb04e9): https://hub.docker.com/layers/pivotalrabbitmq/rabbitmq/sha-acb04e929f9b1030dd077c62991147876b08fff5/images/sha256-353eb18c3b4d98fd393eacaf98234a90bfc582f0c6d1f38232f8a4317b0cf4be?context=explore
It is technically 4.1.0-based but there are no meaningful changes compared to 4.0.2
(except for this PR in the image, that is).
Please give it a shot and let us know how it goes.
Thanks @michaelklishin , yes 4.1 based is ok. I will try to deploy the image on Monday and keep you updated on the results.
I deployed the image this morning on the same env from the previous logs in 4.0.x, and it did not go well. The RMQ server fully crashed soon after starting. The stack trace is:
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> [{rabbit_msg_store,scan_file_for_valid_messages,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> [{file,"rabbit_msg_store.erl"},{line,1487}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> {rabbit_msg_store,scan_and_vacuum_message_file,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> [{file,"rabbit_msg_store.erl"},{line,2108}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> {rabbit_msg_store,compact_file,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> [{file,"rabbit_msg_store.erl"},{line,1943}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> {rabbit_msg_store_gc,attempt_action,3,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> [{file,"rabbit_msg_store_gc.erl"},{line,110}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> {rabbit_msg_store_gc,handle_cast,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> [{file,"rabbit_msg_store_gc.erl"},{line,68}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> {gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1056}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,251}]}]}
Thanks! I have indeed missed something.
Please try the most up to date image which should be https://hub.docker.com/layers/pivotalrabbitmq/rabbitmq/loic-fix-cq-scan/images/sha256-ab439bb28985d6699f8b5a7c6ca91ca0d183202521f808b888c874b41146c02e?context=explore
Please try the most up to date image which should be https://hub.docker.com/layers/pivotalrabbitmq/rabbitmq/loic-fix-cq-scan/images/sha256-ab439bb28985d6699f8b5a7c6ca91ca0d183202521f808b888c874b41146c02e?context=explore
Thanks a lot, I deployed it this morning on the first env and so far so good 👍 If it's still good tomorrow, I will deploy it on more environments and let it run for a few days.
Hi all! We're observing the same issue. We went from 3.10.0 on CentOS7 to 3.13.2 on Debian 12 last Thursday and since then had those crashes every night in our canary environment where we mirror live traffic of our application and test new releases before staging them to the live environment. Last night I tried out 4.0.2 and the crashes were still happening, although RabbitMQ managed to automatically restart affected queues, what didn't work with 3.13.2. I now went back to 3.12.14 and will again observe for a night or two.
Long story short: If someone can provide a DEB package with the attempted fix, I can do some real world testing of the fix in our canary environment afterwards. Since we have no real users on there, It would be quite easy for me to do.
I'm sure @michaelklishin can help.
@kaistierl I will build a 64-bit Debian package (off of main
plus this PR) later today.
@kaistierl would an ARM64 build do? Our x86-64 environment that I used to produce one-off builds needs to be re-created on different infrastructure for reasons outside of my control, and it will take some time.
But I can produce a local ARM64 Debian package.
@michaelklishin unfortunately not - I would need a x86-64 package.
...by the way: After downgrading to 3.12.14 everything is running stable again for me, so my observations match what was reported by others.
Here is a generic binary build of #12392 which is platform-independent https://github.com/rabbitmq/rabbitmq-server/actions/runs/11120630513/artifacts/1999351253.
I cannot promise any ETA for one-off x86-64 builds.
Note that both 3.12 and 3.13 are out of community support so when a new patch release is out, you will have to move to 4.0.x
, a 3.13.x
release will be available later and to paying customers only.
@netrmqdev any updates from your clusters?
@netrmqdev any updates from your clusters?
For reasons unrelated to RabbitMQ, I was unable to get as much testing as I wanted to on my envs. However I now have about 4 of them running the patch for ~12-24 hours each, with no issues so far. They will keep running over the week-end so on Monday I can give more feedback 👍
@kaistierl as the patch only modifies one file you can extract rabbit_msg_store.beam
from the generic build Michael provided and copy it into a vanilla 4.0.2 install (under `
@gomoripeti RabbitMQ remains open source software under a specific and well documented support policy.
Except for serious enough security patches, there will be no more OSS RabbitMQ 3.13.x releases. Those who expect 3.13.x releases must pay and get them as part of Tanzu RabbitMQ.
Let's keep this discussion specific to this issue. Any attempts to convince our team to produce a 3.13.x release will be removed. This is something non-negotiable.
The patch will be backported to v3.13.x
, that's all the core team can offer to non-paying (to VMware/Broadcom) 3.13.x
users.
Describe the bug
After upgrading RMQ server to version 3.13.6 or 3.13.7 from version 3.12.14, we started having randomly "Restarting crashed queue" errors for some of our queues. Those queues can have a throughput of 300 event/s, with big messages. We have also tested version 4.0.1 but the same issue was also spotted. Everytime we rollback to 3.12.14, the issue no longer happens.
Our topology is quite simple, we have only one RabbitMQ server running on k8s with the Docker management image. Clients publishing and consuming the affected queues are written in .NET and use the officiel .NET RabitMQ client.
Here are some stacktraces of when that happens:
3.13.7
4.0.1
Those issues seem very similar to what is described here: https://github.com/rabbitmq/rabbitmq-server/discussions/10902 However this issue is supposed to be fixed since 3.13.2.
Reproduction steps
Cannot reproduce in a test environment.
Expected behavior
No queue crashing
Additional context
No response