Closed riyad closed 9 years ago
Thanks, I think I've found the reason, here:
https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_dead_letter.erl#L148 https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_dead_letter.erl#L156
We assume that the <<"count">>
field is long
therefore the case clause failure.
Can you please provide a script that reproduces the issue?
On 7/7/2015, at 15:24, Alvaro Videla notifications@github.com wrote:
Thanks, I think I've found the reason, here:
https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_dead_letter.erl#L148 https://github.com/rabbitmq/rabbitmq-server/blob/master/src/rabbit_dead_letter.erl#L156
We assume that the <<"count">> field is long therefore the case clause failure.
— Reply to this email directly or view it on GitHub.
I can confirm this issue, which is also happening for us after upgrading from RabbitMQ 3.5.0
to 3.5.3
on Ubuntu Server 14.04 LTS
, running in AWS EC2. Initially, I was running dist Erlang R16B03
, but now I'm running Erlang R18
. Interaction with RabbitMQ is using Ruby, through library Bunny 1.7.0
, (with library amq-protocol 1.9.2
).
Our logic is similar to that described by @riyad, where messages are published through a dead-letter exchange potentially multiple times. I reinstalled our broker yesterday, and all was running fine until this afternoon, when errors were reported in RabbitMQ logs, after which CPU immediately started increasing fairly steadily, until the Erlang beam.smp
process reported almost maxing out the CPUs. Messages being published or consumed are currently very low, around 1 message per 5 minutes. Number of messages in all queues is low, under 20 total. We're using durable queues and persistent messages. Memory is low, far below the watermark.
Peace, tiredpixel
Status of node 'rabbit@i-x.example.com' ...
[{pid,596},
{running_applications,
[{rabbitmq_management,"RabbitMQ Management Console","3.5.3"},
{rabbitmq_management_agent,"RabbitMQ Management Agent","3.5.3"},
{rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.5.3"},
{rabbit,"RabbitMQ","3.5.3"},
{webmachine,"webmachine","1.10.3-rmq3.5.3-gite9359c7"},
{mochiweb,"MochiMedia Web Server","2.7.0-rmq3.5.3-git680dba8"},
{amqp_client,"RabbitMQ AMQP Client","3.5.3"},
{xmerl,"XML parser","1.3.8"},
{mnesia,"MNESIA CXC 138 12","4.13"},
{inets,"INETS CXC 138 49","6.0"},
{os_mon,"CPO CXC 138 46","2.4"},
{sasl,"SASL CXC 138 11","2.5"},
{stdlib,"ERTS CXC 138 10","2.5"},
{kernel,"ERTS CXC 138 10","4.0"}]},
{os,{unix,linux}},
{erlang_version,
"Erlang/OTP 18 [erts-7.0] [source] [64-bit] [smp:2:2] [async-threads:30] [kernel-poll:true]\n"},
{memory,
[{total,383717416},
{connection_readers,1325456},
{connection_writers,7694952},
{connection_channels,125132600},
{connection_other,85150160},
{queue_procs,19523888},
{queue_slave_procs,0},
{plugins,20670368},
{other_proc,7088664},
{mnesia,1589408},
{mgmt_db,34545264},
{msg_index,450864},
{other_ets,3566880},
{binary,44084160},
{code,20075250},
{atom,785313},
{other_system,12034189}]},
{alarms,[]},
{listeners,[{clustering,25672,"::"},{amqp,5672,"::"}]},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,6416846028},
{disk_free_limit,50000000},
{disk_free,1036529143808},
{file_descriptors,
[{total_limit,16284},
{total_used,133},
{sockets_limit,14653},
{sockets_used,99}]},
{processes,[{limit,1048576},{used,26879}]},
{run_queue,1193},
{uptime,87068}]
Cluster status of node 'rabbit@i-x.example.com' ...
[{nodes,[{disc,['rabbit@i-x.example.com']}]},
{running_nodes,['rabbit@i-x.example.com']},
{cluster_name,<<"rabbit@i-x.example.com">>},
{partitions,[]}]
=ERROR REPORT==== 7-Jul-2015::14:30:49 ===
** Generic server <0.755.0> terminating
** Last message in was {drop_expired,1}
** When Server state == {q,
{amqqueue,
{resource,<<"x_messenger">>,queue,
<<"ha.x_messenger.app_x_queue_error">>},
true,false,none,
[{<<"x-dead-letter-exchange">>,longstr,
<<"ha.x_messenger.app_x_queue">>},
{<<"x-message-ttl">>,signedint,300000}],
<0.755.0>,[],[],[],undefined,[],[],live},
none,false,rabbit_priority_queue,
{passthrough,rabbit_variable_queue,
{vqstate,
{0,{[],[]}},
{0,{[],[]}},
{delta,undefined,0,undefined},
{0,{[],[]}},
{2,
{[{msg_status,16,
<<145,236,191,185,197,127,102,104,95,63,130,119,
35,220,135,98>>,
{basic_message,
{resource,<<"x_messenger">>,exchange,
<<"ha.x_messenger.app_x_queue_error">>},
[<<"ha.x_messenger.app_x_queue">>],
{content,60,
{'P_basic',<<"application/octet-stream">>,
undefined,
[{<<"x-death">>,array,
[{table,
[{<<"count">>,signedint,1},
{<<"reason">>,longstr,<<"expired">>},
{<<"queue">>,longstr,
<<"ha.x_messenger.app_x_queue_error">>},
{<<"time">>,timestamp,1436273833},
{<<"exchange">>,longstr,
<<"ha.x_messenger.app_x_queue_error">>},
{<<"routing-keys">>,array,
[{longstr,
<<"ha.x_messenger.app_x_queue">>}]}]}]}],
2,0,undefined,undefined,undefined,
<<"79d42774-1346-4bc9-a08d-3ba45c61a79a">>,
undefined,undefined,undefined,undefined,
undefined},
<<184,128,24,97,112,112,108,105,99,97,116,105,
111,110,47,111,99,116,101,116,45,115,116,
114,101,97,109,0,0,1,25,7,120,45,100,101,97,
116,104,65,0,0,1,12,70,0,0,1,7,5,99,111,117,
110,116,73,0,0,0,1,6,114,101,97,115,111,110,
83,0,0,0,7,101,120,112,105,114,101,100,5,
113,117,101,117,101,83,0,0,0,59,104,97,46,
100,99,95,109,101,115,115,101,110,103,101,
114,95,112,114,111,100,46,118,49,46,116,114,
97,110,115,102,111,114,109,101,114,95,100,
99,95,97,114,99,104,105,118,101,115,95,112,
97,114,116,115,95,101,114,114,111,114,4,116,
105,109,101,84,0,0,0,0,85,155,204,169,8,101,
120,99,104,97,110,103,101,83,0,0,0,59,104,
97,46,100,99,95,109,101,115,115,101,110,103,
101,114,95,112,114,111,100,46,118,49,46,116,
114,97,110,115,102,111,114,109,101,114,95,
100,99,95,97,114,99,104,105,118,101,115,95,
112,97,114,116,115,95,101,114,114,111,114,
12,114,111,117,116,105,110,103,45,107,101,
121,115,65,0,0,0,58,83,0,0,0,53,104,97,46,
100,99,95,109,101,115,115,101,110,103,101,
114,95,112,114,111,100,46,118,49,46,116,114,
97,110,115,102,111,114,109,101,114,95,100,
99,95,97,114,99,104,105,118,101,115,95,112,
97,114,116,115,2,0,36,55,57,100,52,50,55,55,
52,45,49,51,52,54,45,52,98,99,57,45,97,48,
56,100,45,51,98,97,52,53,99,54,49,97,55,57,
97>>,
rabbit_framing_amqp_0_9_1,
[<<133,162,116,114,171,99,114,97,119,108,95,
115,116,101,112,115,166,116,114,95,115,
101,116,131,166,112,114,101,102,105,120,
169,99,114,97,119,108,46,51,54,54,174,
116,114,97,110,115,102,111,114,109,101,
100,95,97,116,183,50,48,49,53,45,48,55,
45,48,55,32,48,50,58,53,56,58,51,48,32,
85,84,67,172,116,114,97,110,115,102,111,
114,109,95,105,100,210,138,95,31,211,166,
105,110,112,117,116,115,129,168,99,114,
97,119,108,95,105,100,205,1,110,164,112,
97,114,116,132,164,105,100,95,102,184,53,
53,57,97,98,48,56,57,54,57,50,100,51,53,
51,51,97,49,101,56,48,52,48,48,164,105,
100,95,116,184,53,53,57,97,98,49,101,51,
54,57,50,100,51,53,51,51,99,100,56,52,48,
54,48,48,164,105,100,95,111,205,35,40,
164,105,100,95,105,10,172,99,97,108,108,
98,97,99,107,95,107,101,121,131,168,112,
105,101,99,101,95,105,100,205,15,183,177,
116,114,97,110,115,102,111,114,109,97,
116,105,111,110,95,105,100,205,2,12,181,
116,114,97,110,115,102,111,114,109,97,
116,105,111,110,95,115,101,116,95,105,
100,205,1,6>>]},
<<145,236,191,185,197,127,102,104,95,63,130,
119,35,220,135,98>>,
true},
true,false,false,true,queue_index,
{message_properties,1436279640057485,true,
270}}],
[{msg_status,15,
<<168,84,203,115,201,232,154,241,51,33,92,99,
248,170,24,198>>,
{basic_message,
{resource,<<"x_messenger">>,exchange,
<<"ha.x_messenger.app_x_queue_error">>},
[<<"ha.x_messenger.app_x_queue">>],
{content,60,
{'P_basic',<<"application/octet-stream">>,
undefined,
[{<<"x-death">>,array,
[{table,
[{<<"count">>,signedint,1},
{<<"reason">>,longstr,<<"expired">>},
{<<"queue">>,longstr,
<<"ha.x_messenger.app_x_queue_error">>},
{<<"time">>,timestamp,1436238931},
{<<"exchange">>,longstr,
<<"ha.x_messenger.app_x_queue_error">>},
{<<"routing-keys">>,array,
[{longstr,
<<"ha.x_messenger.app_x_queue">>}]}]}]}],
2,0,undefined,undefined,undefined,
<<"788e03ca-672a-4c46-8133-6a2fc2c4f4f4">>,
undefined,undefined,undefined,undefined,
undefined},
<<184,128,24,97,112,112,108,105,99,97,116,105,
111,110,47,111,99,116,101,116,45,115,116,
114,101,97,109,0,0,1,25,7,120,45,100,101,97,
116,104,65,0,0,1,12,70,0,0,1,7,5,99,111,117,
110,116,73,0,0,0,1,6,114,101,97,115,111,110,
83,0,0,0,7,101,120,112,105,114,101,100,5,
113,117,101,117,101,83,0,0,0,59,104,97,46,
100,99,95,109,101,115,115,101,110,103,101,
114,95,112,114,111,100,46,118,49,46,116,114,
97,110,115,102,111,114,109,101,114,95,100,
99,95,97,114,99,104,105,118,101,115,95,112,
97,114,116,115,95,101,114,114,111,114,4,116,
105,109,101,84,0,0,0,0,85,155,68,83,8,101,
120,99,104,97,110,103,101,83,0,0,0,59,104,
97,46,100,99,95,109,101,115,115,101,110,103,
101,114,95,112,114,111,100,46,118,49,46,116,
114,97,110,115,102,111,114,109,101,114,95,
100,99,95,97,114,99,104,105,118,101,115,95,
112,97,114,116,115,95,101,114,114,111,114,
12,114,111,117,116,105,110,103,45,107,101,
121,115,65,0,0,0,58,83,0,0,0,53,104,97,46,
100,99,95,109,101,115,115,101,110,103,101,
114,95,112,114,111,100,46,118,49,46,116,114,
97,110,115,102,111,114,109,101,114,95,100,
99,95,97,114,99,104,105,118,101,115,95,112,
97,114,116,115,2,0,36,55,56,56,101,48,51,99,
97,45,54,55,50,97,45,52,99,52,54,45,56,49,
51,51,45,54,97,50,102,99,50,99,52,102,52,
102,52>>,
rabbit_framing_amqp_0_9_1,
[<<133,162,116,114,171,99,114,97,119,108,95,
115,116,101,112,115,166,116,114,95,115,
101,116,131,166,112,114,101,102,105,120,
169,99,114,97,119,108,46,51,54,54,174,
116,114,97,110,115,102,111,114,109,101,
100,95,97,116,183,50,48,49,53,45,48,55,
45,48,55,32,48,50,58,53,56,58,51,48,32,
85,84,67,172,116,114,97,110,115,102,111,
114,109,95,105,100,210,138,95,31,211,166,
105,110,112,117,116,115,129,168,99,114,
97,119,108,95,105,100,205,1,110,164,112,
97,114,116,132,164,105,100,95,102,184,53,
53,57,97,98,53,102,53,54,57,50,100,51,53,
51,53,48,49,54,53,48,55,48,48,164,105,
100,95,116,184,53,53,57,97,98,55,51,57,
54,57,50,100,51,57,54,50,49,98,55,48,48,
54,48,48,164,105,100,95,111,205,50,200,
164,105,100,95,105,14,172,99,97,108,108,
98,97,99,107,95,107,101,121,131,168,112,
105,101,99,101,95,105,100,205,15,187,177,
116,114,97,110,115,102,111,114,109,97,
116,105,111,110,95,105,100,205,2,12,181,
116,114,97,110,115,102,111,114,109,97,
116,105,111,110,95,115,101,116,95,105,
100,205,1,6>>]},
<<168,84,203,115,201,232,154,241,51,33,92,99,
248,170,24,198>>,
true},
true,false,false,true,queue_index,
{message_properties,1436279449181213,true,
270}}]}},
17,
{0,nil},
{0,nil},
{0,nil},
{qistate,
"/var/lib/rabbitmq/mnesia/rabbit@i-x.example.com/queues/69RXCRA0YQFLNESM4YE1XHF0G",
{{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
[{segment,0,
"/var/lib/rabbitmq/mnesia/rabbit@i-x.example.com/queues/69RXCRA0YQFLNESM4YE1XHF0G/0.idx",
{array,16384,0,undefined,100000},
2}]},
#Ref<0.0.1572866.70861>,0,65536,
#Fun<rabbit_variable_queue.2.117761292>,
#Fun<rabbit_variable_queue.3.48316793>,
{0,nil},
{0,nil}},
{{client_msstate,msg_store_persistent,
<<13,24,107,90,238,140,22,42,193,193,247,117,63,
25,47,49>>,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
{state,225357,
"/var/lib/rabbitmq/mnesia/rabbit@i-x.example.com/msg_store_persistent"},
rabbit_msg_store_ets_index,
"/var/lib/rabbitmq/mnesia/rabbit@i-x.example.com/msg_store_persistent",
<0.240.0>,229454,221260,233551,237648},
{client_msstate,msg_store_transient,
<<101,74,79,173,74,7,174,242,180,223,118,182,139,
198,89,135>>,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
{state,204872,
"/var/lib/rabbitmq/mnesia/rabbit@i-x.example.com/msg_store_transient"},
rabbit_msg_store_ets_index,
"/var/lib/rabbitmq/mnesia/rabbit@i-x.example.com/msg_store_transient",
<0.233.0>,208969,200775,213066,217163}},
true,0,2,540,0,2,540,infinity,2,0,0,540,0,0,
{rates,0.005228732197132713,
1.7370371540135685e-303,1.7370371540135685e-303,
1.7370371540135685e-303,
{1436,279341,778502}},
{0,nil},
{0,nil},
{0,nil},
{0,nil},
0,0,0,17}},
{state,
{queue,[],[],0},
{inactive,1436238631386029,41942114405,1.0}},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
300000,
{erlang,#Ref<0.0.3932161.80107>},
1436279449181213,
{state,
{dict,2,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],[],
[[<0.3455.6>|#Ref<0.0.3932161.80106>]],
[],[],[],[],[],[],[],
[[<0.3479.6>|#Ref<0.0.3932161.86474>]],
[],[],[]}}},
delegate},
{resource,<<"x_messenger">>,exchange,
<<"ha.x_messenger.app_x_queue">>},
undefined,undefined,undefined,1,running}
** Reason for termination ==
** {{case_clause,{value,{<<"count">>,signedint,1}}},
[{rabbit_dead_letter,x_death_event_key,3,[]},
{rabbit_dead_letter,increment_xdeath_event_count,1,[]},
{rabbit_dead_letter,update_x_death_header,2,[]},
{rabbit_dead_letter,'-make_msg/5-fun-2-',8,[]},
{rabbit_basic,map_headers,2,[]},
{rabbit_dead_letter,make_msg,5,[]},
{rabbit_dead_letter,publish,5,[]},
{rabbit_amqqueue_process,'-dead_letter_msgs/4-fun-0-',7,[]}]}
=ERROR REPORT==== 7-Jul-2015::14:30:49 ===
Restarting crashed queue 'ha.x_messenger.app_x_queue_error' in vhost 'x_messenger'.
=ERROR REPORT==== 7-Jul-2015::14:30:49 ===
** Generic server <0.7229.6> terminating
** Last message in was {'$gen_cast',init}
** When Server state == {q,{amqqueue,
{resource,<<"x_messenger">>,queue,
<<"ha.x_messenger.app_x_queue_error">>},
true,false,none,
[{<<"x-dead-letter-exchange">>,longstr,
<<"ha.x_messenger.app_x_queue">>},
{<<"x-message-ttl">>,signedint,300000}],
<0.7229.6>,[],[],[],undefined,[],[],crashed},
none,false,undefined,undefined,
{state,
{queue,[],[],0},
{active,1436279449185300,1.0}},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{state,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[],[]}}},
delegate},
undefined,undefined,undefined,undefined,0,running}
** Reason for termination ==
** {{case_clause,{value,{<<"count">>,signedint,1}}},
[{rabbit_dead_letter,x_death_event_key,3,[]},
{rabbit_dead_letter,increment_xdeath_event_count,1,[]},
{rabbit_dead_letter,update_x_death_header,2,[]},
{rabbit_dead_letter,'-make_msg/5-fun-2-',8,[]},
{rabbit_basic,map_headers,2,[]},
{rabbit_dead_letter,make_msg,5,[]},
{rabbit_dead_letter,publish,5,[]},
{rabbit_amqqueue_process,'-dead_letter_msgs/4-fun-0-',7,[]}]}
=ERROR REPORT==== 7-Jul-2015::14:30:49 ===
Restarting crashed queue 'ha.x_messenger.app_x_queue_error' in vhost 'x_messenger'.
=ERROR REPORT==== 7-Jul-2015::14:30:49 ===
** Generic server <0.7231.6> terminating
** Last message in was {'$gen_cast',init}
** When Server state == {q,{amqqueue,
{resource,<<"x_messenger">>,queue,
<<"ha.x_messenger.app_x_queue_error">>},
true,false,none,
[{<<"x-dead-letter-exchange">>,longstr,
<<"ha.x_messenger.app_x_queue">>},
{<<"x-message-ttl">>,signedint,300000}],
<0.7231.6>,[],[],[],undefined,[],[],live},
none,false,undefined,undefined,
{state,
{queue,[],[],0},
{active,1436279449191151,1.0}},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{state,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[],[]}}},
delegate},
undefined,undefined,undefined,undefined,0,running}
** Reason for termination ==
** {{case_clause,{value,{<<"count">>,signedint,1}}},
[{rabbit_dead_letter,x_death_event_key,3,[]},
{rabbit_dead_letter,increment_xdeath_event_count,1,[]},
{rabbit_dead_letter,update_x_death_header,2,[]},
{rabbit_dead_letter,'-make_msg/5-fun-2-',8,[]},
{rabbit_basic,map_headers,2,[]},
{rabbit_dead_letter,make_msg,5,[]},
{rabbit_dead_letter,publish,5,[]},
{rabbit_amqqueue_process,'-dead_letter_msgs/4-fun-0-',7,[]}]}
=ERROR REPORT==== 7-Jul-2015::14:30:49 ===
Restarting crashed queue 'ha.x_messenger.app_x_queue_error' in vhost 'x_messenger'.
=ERROR REPORT==== 7-Jul-2015::14:30:49 ===
** Generic server <0.7233.6> terminating
** Last message in was {'$gen_cast',init}
** When Server state == {q,{amqqueue,
{resource,<<"x_messenger">>,queue,
<<"ha.x_messenger.app_x_queue_error">>},
true,false,none,
[{<<"x-dead-letter-exchange">>,longstr,
<<"ha.x_messenger.app_x_queue">>},
{<<"x-message-ttl">>,signedint,300000}],
<0.7233.6>,[],[],[],undefined,[],[],live},
none,false,undefined,undefined,
{state,
{queue,[],[],0},
{active,1436279449196370,1.0}},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{state,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[],[]}}},
delegate},
undefined,undefined,undefined,undefined,0,running}
** Reason for termination ==
** {{case_clause,{value,{<<"count">>,signedint,1}}},
[{rabbit_dead_letter,x_death_event_key,3,[]},
{rabbit_dead_letter,increment_xdeath_event_count,1,[]},
{rabbit_dead_letter,update_x_death_header,2,[]},
{rabbit_dead_letter,'-make_msg/5-fun-2-',8,[]},
{rabbit_basic,map_headers,2,[]},
{rabbit_dead_letter,make_msg,5,[]},
{rabbit_dead_letter,publish,5,[]},
{rabbit_amqqueue_process,'-dead_letter_msgs/4-fun-0-',7,[]}]}
=ERROR REPORT==== 7-Jul-2015::14:30:49 ===
Restarting crashed queue 'ha.x_messenger.app_x_queue_error' in vhost 'x_messenger'.
=ERROR REPORT==== 7-Jul-2015::14:30:49 ===
** Generic server <0.7235.6> terminating
** Last message in was {'$gen_cast',init}
** When Server state == {q,{amqqueue,
{resource,<<"x_messenger">>,queue,
<<"ha.x_messenger.app_x_queue_error">>},
true,false,none,
[{<<"x-dead-letter-exchange">>,longstr,
<<"ha.x_messenger.app_x_queue">>},
{<<"x-message-ttl">>,signedint,300000}],
<0.7235.6>,[],[],[],undefined,[],[],live},
none,false,undefined,undefined,
{state,
{queue,[],[],0},
{active,1436279449201433,1.0}},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{state,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[],[]}}},
delegate},
undefined,undefined,undefined,undefined,0,running}
** Reason for termination ==
** {{case_clause,{value,{<<"count">>,signedint,1}}},
[{rabbit_dead_letter,x_death_event_key,3,[]},
{rabbit_dead_letter,increment_xdeath_event_count,1,[]},
{rabbit_dead_letter,update_x_death_header,2,[]},
{rabbit_dead_letter,'-make_msg/5-fun-2-',8,[]},
{rabbit_basic,map_headers,2,[]},
{rabbit_dead_letter,make_msg,5,[]},
{rabbit_dead_letter,publish,5,[]},
{rabbit_amqqueue_process,'-dead_letter_msgs/4-fun-0-',7,[]}]}
=ERROR REPORT==== 7-Jul-2015::14:30:49 ===
Restarting crashed queue 'ha.x_messenger.app_x_queue_error' in vhost 'x_messenger'.
=ERROR REPORT==== 7-Jul-2015::14:30:49 ===
** Generic server <0.7237.6> terminating
** Last message in was {'$gen_cast',init}
** When Server state == {q,{amqqueue,
{resource,<<"x_messenger">>,queue,
<<"ha.x_messenger.app_x_queue_error">>},
true,false,none,
[{<<"x-dead-letter-exchange">>,longstr,
<<"ha.x_messenger.app_x_queue">>},
{<<"x-message-ttl">>,signedint,300000}],
<0.7237.6>,[],[],[],undefined,[],[],live},
none,false,undefined,undefined,
{state,
{queue,[],[],0},
{active,1436279449206805,1.0}},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{state,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],
[],[],[]}}},
delegate},
undefined,undefined,undefined,undefined,0,running}
** Reason for termination ==
** {{case_clause,{value,{<<"count">>,signedint,1}}},
[{rabbit_dead_letter,x_death_event_key,3,[]},
{rabbit_dead_letter,increment_xdeath_event_count,1,[]},
{rabbit_dead_letter,update_x_death_header,2,[]},
{rabbit_dead_letter,'-make_msg/5-fun-2-',8,[]},
{rabbit_basic,map_headers,2,[]},
{rabbit_dead_letter,make_msg,5,[]},
{rabbit_dead_letter,publish,5,[]},
{rabbit_amqqueue_process,'-dead_letter_msgs/4-fun-0-',7,[]}]}
We understand what the issue is but not what causes it. Can someone post a small sample that reproduces the issue? We will turn it into a test case.
Sorry, I don't have a sample I'm able to post easily at present. But I can add to the above that it does indeed seem to be rabbit_dead_letter.erl
increment_xdeath_event_count
case x_death_event_key(Info, <<"count">>, long) of
line; when I patched that to disable the attempt to read and increment <<"count">>
(which we're not currently using), instead following the undefined
branch of the case, the broker freed up and CPU usage went right back down with existing messages being processed. My apologies for not being able to post a test case code sample.
Peace, tiredpixel
@tiredpixel thanks. A more specific question: do your apps set the count
header? Or is it only ever touched by RabbitMQ on dead-lettering?
@michaelklishin, our apps don't set or use the count
header at all. It's only ever touched by RabbitMQ on dead-lettering. I noted that <<"count">>
seems to be signedint
at that point, but I wasn't able to spot why it's such rather than a long
. As far as I got (before patching RabbitMQ as above) was that this seems to be for messages which have gone through the dead-letter exchange a few times, by which time <<"count">>
appears to be already set—I suspect following update_x_death_header
Info1
[{table, M}]
matches branch being followed, if that helps.
@tiredpixel thanks. We do have tests that result in a message being dead-lettered multiple times. I have a couple of ideas. 3.5.4
won't be released without a fix for this.
You can provoke the crash of the queue with the Python script found in https://gist.github.com/riyad/7439eb545dabf287bcc9.
It requires you to install the Puka library and update the AMQP_URL
in the script to point to your RabbitMQ.
@riyad perfect, thank you!
@michaelklishin, understood; thank you for looking into it. :) @riyad, thank you for contributing a test case to this investigation. :)
Peace, tiredpixel
I can reproduce the issue. Looking into it.
FYI: I fixed a small issue with the demo script. It was missing a wait()
when publishing the first message. :sweat_smile:
see: https://gist.github.com/riyad/7439eb545dabf287bcc9/revisions
Cheers.
@riyad I have a fix, does this output confirm correct execution? https://gist.github.com/michaelklishin/97355747f6b85dfe3aae — the exception is gone.
Basically there should be an additional x-death entry for every time the message is deadlettered. If there is a message in the queue at the end of the script with all the x-deaths everything should be OK. The important part is, the message must not be lost.
I see
rabbitmqctl list_queues
Listing queues ...
test.issue_216_crashing_queue 3
after 3 runs, so it must be working as expected.
@michaelklishin To be sure it would be nice if you could provide the output of the script with an empty queue.
λ ~/Development/RabbitMQ/debug/ rabbitmqctl list_queues
Listing queues ...
test.issue_216_crashing_queue 0
λ ~/Development/RabbitMQ/debug/ python server-216.py
connecting ...
setting up ...
publishing ...
fetched message: {'body': 'foo', 'exchange': 'amq.topic', 'promise_number': 7, 'routing_key': 'test.issue_216', 'message_count': 0, 'headers': {'x-puka-delivery-tag': 1}, 'redelivered': False, 'delivery_tag': 1}
declaring timeout queue toq-gen0-test.issue_216_crashing_queue
publishing to toq-gen0-test.issue_216_crashing_queue: {'body': 'foo', 'exchange': 'amq.topic', 'promise_number': 7, 'routing_key': 'test.issue_216', 'message_count': 0, 'headers': {'x-puka-delivery-tag': 1}, 'redelivered': False, 'delivery_tag': 1}
fetched message: {'cluster_id': '', 'empty': True}
fetched message: {'body': 'foo', 'exchange': 'amq.topic', 'promise_number': 11, 'routing_key': 'test.issue_216_crashing_queue', 'message_count': 0, 'headers': {'x-puka-delivery-tag': 2, 'x-death': [{'count': 1L, 'exchange': '', 'time': datetime.datetime(2015, 7, 8, 10, 53, 56), 'queue': 'toq-gen0-test.issue_216_crashing_queue', 'reason': 'expired', 'routing-keys': ['toq-gen0-test.issue_216_crashing_queue']}]}, 'redelivered': False, 'delivery_tag': 2}
declaring timeout queue toq-gen1-test.issue_216_crashing_queue
publishing to toq-gen1-test.issue_216_crashing_queue: {'body': 'foo', 'exchange': 'amq.topic', 'promise_number': 11, 'routing_key': 'test.issue_216_crashing_queue', 'message_count': 0, 'headers': {'x-puka-delivery-tag': 2, 'x-death': [{'count': 1L, 'exchange': '', 'time': datetime.datetime(2015, 7, 8, 10, 53, 56), 'queue': 'toq-gen0-test.issue_216_crashing_queue', 'reason': 'expired', 'routing-keys': ['toq-gen0-test.issue_216_crashing_queue']}]}, 'redelivered': False, 'delivery_tag': 2}
fetched message: {'cluster_id': '', 'empty': True}
fetched message: {'body': 'foo', 'exchange': 'amq.topic', 'promise_number': 15, 'routing_key': 'test.issue_216_crashing_queue', 'message_count': 0, 'headers': {'x-puka-delivery-tag': 3, 'x-death': [{'count': 1L, 'exchange': '', 'time': datetime.datetime(2015, 7, 8, 10, 53, 56), 'queue': 'toq-gen1-test.issue_216_crashing_queue', 'reason': 'expired', 'routing-keys': ['toq-gen1-test.issue_216_crashing_queue']}, {'count': 1, 'exchange': '', 'routing-keys': ['toq-gen0-test.issue_216_crashing_queue'], 'queue': 'toq-gen0-test.issue_216_crashing_queue', 'reason': 'expired', 'time': datetime.datetime(2015, 7, 8, 10, 53, 56)}]}, 'redelivered': False, 'delivery_tag': 3}
declaring timeout queue toq-gen2-test.issue_216_crashing_queue
publishing to toq-gen2-test.issue_216_crashing_queue: {'body': 'foo', 'exchange': 'amq.topic', 'promise_number': 15, 'routing_key': 'test.issue_216_crashing_queue', 'message_count': 0, 'headers': {'x-puka-delivery-tag': 3, 'x-death': [{'count': 1L, 'exchange': '', 'time': datetime.datetime(2015, 7, 8, 10, 53, 56), 'queue': 'toq-gen1-test.issue_216_crashing_queue', 'reason': 'expired', 'routing-keys': ['toq-gen1-test.issue_216_crashing_queue']}, {'count': 1, 'exchange': '', 'routing-keys': ['toq-gen0-test.issue_216_crashing_queue'], 'queue': 'toq-gen0-test.issue_216_crashing_queue', 'reason': 'expired', 'time': datetime.datetime(2015, 7, 8, 10, 53, 56)}]}, 'redelivered': False, 'delivery_tag': 3}
λ ~/Development/RabbitMQ/debug/ rabbitmqctl list_queues
Listing queues ...
test.issue_216_crashing_queue 1
Looks good :)
So the culprit seems to be the fact that the script modifies headers (see the filter_headers
function). That's why our existing test suite didn't catch this and this is client-specific.
Hmm ... so you're saying Puka is to blame? Am I correct that the issue is that header fields were encoded with the "wrong" type when republishing (even when they're superficially the same)?
Not Puka but I could only trigger this with Puka, it serializes integers differently from the Java client (e.g. JVM distinguishes between integers and longs, and many dynamically-typed languages use automatic promotion on overflow).
OK, a decent fix is in #221. We'll release 3.5.4 RC
very soon and if new edge cases are found, will add them to Bunny's test suite (one of the client test suites we use).
Thank a lot for fixing this so quickly. :smile:
Saw the fix. I think you can still check if the client sent one of the AMQP's "integer types". See here for an example https://github.com/rabbitmq/rabbitmq-delayed-message-exchange/blob/master/src/rabbit_delayed_message_utils.erl#L39 we do the same when we handle queue arguments inside rabbit_amqqueue I think. On Wed, Jul 8, 2015 at 1:53 PM Michael Klishin notifications@github.com wrote:
Closed #216 https://github.com/rabbitmq/rabbitmq-server/issues/216 via 508d76d https://github.com/rabbitmq/rabbitmq-server/commit/508d76d31984caf91977d2b0fdb095ed829981b3 .
— Reply to this email directly or view it on GitHub https://github.com/rabbitmq/rabbitmq-server/issues/216#event-350694132.
@michaelklishin, oh, such a speedy patch—very much obliged to you! :) I'll look forward to 3.5.4 RC. :)
@videlalvaro that would require extracting validators into rabbit_misc
or similar. We are under pressure to release 3.5.4
and the clusterer, so we'll do what you suggest for 3.5.5
.
I'll just want to prevent data corruption/future crashes if the user sends some stuff that's not even an int anymore. In that case if we do some math with the count header, then we will have another crash. On Wed, Jul 8, 2015 at 10:21 PM Michael Klishin notifications@github.com wrote:
@videlalvaro https://github.com/videlalvaro that would require extracting validators into rabbit_misc or similar. We are under pressure to release 3.5.4 and the clusterer, so we'll do what you suggest for 3.5.5 .
— Reply to this email directly or view it on GitHub https://github.com/rabbitmq/rabbitmq-server/issues/216#issuecomment-119719379 .
If the user modifies x-death
events then it counts as a user error. I'm not sure what we can do if "count" is something other than a numerical value. Ignore it entirely?
Reset to default? On Wed, Jul 8, 2015 at 10:48 PM Michael Klishin notifications@github.com wrote:
If the user modifies x-death events then it counts as a user error. I'm not sure what we can do if "count" is something other than a numerical value. Ignore it entirely?
— Reply to this email directly or view it on GitHub https://github.com/rabbitmq/rabbitmq-server/issues/216#issuecomment-119726906 .
hi, I am using Rabbit MQ 3.5.4 and experiencing the same issues ?
=ERROR REPORT==== 14-Mar-2016::20:04:20 === * Generic server <0.1073.0> terminating * Last message in was {'$gen_cast',init}
undefined,undefined,undefined,undefined,0,running}
* Reason for termination == * {{case_clause,{<<"x-death">>,longstr, <<"[{reason=expired,
Can you pls suggest me how i can delete the queue ? I am not able to delete this queue from Admin Console.
restarting the node should make it possible to delete the queue before it hits the same exception again. The right thing to do is to upgrade.
On 15 mar 2016, at 3:41, Ram notifications@github.com wrote:
hi, I am using Rabbit MQ 3.5.4 and experiencing the same issues ?
=ERROR REPORT==== 14-Mar-2016::20:04:20 === * Generic server <0.1073.0> terminating * Last message in was {'$gen_cast',init}
undefined,undefined,undefined,undefined,0,running}
* Reason for termination == * {{case_clause,{<<"x-death">>,longstr, <<"[{reason=expired,
Can you pls suggest me how i can delete the queue ? I am not able to delete this queue from Admin Console.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/rabbitmq/rabbitmq-server/issues/216#issuecomment-196640135
Hello. I'm using v3.6.5 (also tested from 3.5.6) and retry queues (task => retry1 (dl/ttl) => task => retry2 (dl/ttl) => task => retry3 (dl/ttl) => dl). Dead lettering the 2nd time causes this error :
=ERROR REPORT==== 14-Nov-2016::11:09:53 ===
** Generic server <0.932.0> terminating
** Last message in was {drop_expired,1}
...
=CRASH REPORT==== 14-Nov-2016::11:09:53 ===
crasher:
initial call: rabbit_prequeue:init/1
pid: <0.932.0>
registered_name: []
exception exit: {function_clause,
[{rabbit_dead_letter,
'-queue_and_reason_matcher/2-fun-1-',
[{array,
[{array,[{longstr,<<"l">>},{longstr,<<"1">>}]},
{array,[{longstr,<<"S">>},{longstr,<<"expired">>}]},
{array,
[{longstr,<<"S">>},
{longstr,
<<"mytask_retry_1">>}]},
{array,
[{longstr,<<"T">>},{longstr,<<"1479121758">>}]},
{array,[{longstr,<<"S">>},{longstr,<<"retry">>}]},
{array,
[{longstr,<<"A">>},
{array,
[{longstr,
<<"mytask_retry_1">>}]}]}]}],
[{file,"src/rabbit_dead_letter.erl"},{line,178}]},
{lists,partition,4,[{file,"lists.erl"},{line,1302}]},
{rabbit_dead_letter,update_x_death_header,2,
[{file,"src/rabbit_dead_letter.erl"},{line,127}]},
{rabbit_dead_letter,'-make_msg/5-fun-2-',8,
[{file,"src/rabbit_dead_letter.erl"},{line,65}]},
{rabbit_basic,map_headers,2,
[{file,"src/rabbit_basic.erl"},{line,258}]},
{rabbit_dead_letter,make_msg,5,
[{file,"src/rabbit_dead_letter.erl"},{line,68}]},
{rabbit_dead_letter,publish,5,
[{file,"src/rabbit_dead_letter.erl"},{line,34}]},
{rabbit_amqqueue_process,'-dead_letter_msgs/4-fun-0-',
7,
[{file,"src/rabbit_amqqueue_process.erl"},
{line,848}]}]}
in function gen_server2:terminate/3 (src/gen_server2.erl, line 1143)
ancestors: [<0.931.0>,rabbit_amqqueue_sup_sup,rabbit_sup,<0.140.0>]
messages: []
links: [<0.931.0>]
dictionary: [{{credit_to,<0.1089.0>},49},
{{xtype_to_module,direct},rabbit_exchange_type_direct},
{process_name,
{rabbit_amqqueue_process,
{resource,<<"myvhost">>,queue,
<<"mytask_retry_2">>}}},
{rand_seed,
{#{max => 288230376151711743,
next => #Fun<rand.8.41921595>,
type => exsplus,
uniform => #Fun<rand.9.41921595>,
uniform_n => #Fun<rand.10.41921595>},
[239153173037508270|193202472509615284]}},
{guid,{{2457094499,2660964459,2278303215,3311440446},1}},
{credit_flow_default_credit,{200,50}}]
trap_exit: true
status: running
heap_size: 2586
stack_size: 27
reductions: 4190
neighbours:
...
Any way to make it work ?
@pdoreau please post an code snippet that reproduces it to rabbitmq-users, our public mailing list
@michaelklishin I posted a message on the user list but it doestn't appear. May be not so usefull to post a snippet as I rely on several layers (swarrot, amqplib). However, I noticed nested arrays are given to Amqplib to publish the retry2 message:
...
["x-death"]=>
array(2) {
[0]=>
string(1) "A"
[1]=>
array(1) {
[0]=>
array(6) {
["count"]=>
array(2) {
[0]=>
string(1) "l"
[1]=>
string(1) "1"
}
["reason"]=>
array(2) {
[0]=>
string(1) "S"
[1]=>
string(7) "expired"
}
["queue"]=>
array(2) {
[0]=>
string(1) "S"
[1]=>
string(36) "mytask_retry_1"
}
["time"]=>
array(2) {
[0]=>
string(1) "T"
[1]=>
string(10) "1479142401"
}
["exchange"]=>
array(2) {
[0]=>
string(1) "S"
[1]=>
string(5) "retry"
}
["routing-keys"]=>
array(2) {
[0]=>
string(1) "A"
[1]=>
array(1) {
[0]=>
string(36) "mytask_retry_1"
}
}
}
}
}
...
Can that be the cause ?
They were in the moderation queue, I approved them.
The x-death
value is an array of objects (in JSON terms), so we need to compare what your app is sending. It is certainly possible for apps to confuse RabbitMQ by sending invalid modified headers, we have seen this several times before.
Here's the rabbitmq-users thread and so far it looks like the client or one of the libraries on top unintentionally modifies the x-death
header which is meant to be modified only by RabbitMQ itself. In any case, this is not the same issue.
We have a similar setup to #161 where we publish messages into (unique) timeout queues that drop the messages back into their original queues for retrying.
Our setup started to loose messages after we updated to RabbitMQ 3.5.3. Looking at the logs we found this:
We were able to reproduce this everytime a message was to drop out of the timeout queue (toq) for the third time (hence the "gen2" part of the queue name in the logs).
Full log (names of queues and exchanges redacted):