rabbitmq / rabbitmq-server

Open source RabbitMQ: core server and tier 1 (built-in) plugins
https://www.rabbitmq.com/
Other
12.28k stars 3.91k forks source link

Stream consumers do not trigger unexpected_chunk_id exceptions on the broker #3359

Closed gerhard closed 3 years ago

gerhard commented 3 years ago

Originally posted by @acogoluegnes in https://github.com/rabbitmq/opportunities/issues/99#issuecomment-910000375 (private repository)

I reproduced a couple of issues on a local node (not a cluster this time). It takes many more attempts to reproduce than with a cluster.

make run-broker PLUGINS='rabbitmq_stream_management'

I also increased the rate of the publishers:

stream-perf-test --streams stream-max-length --max-length-bytes 250MB \
              --consumers 0 \
              --producers 2 \
              --rate 6000 --size 4000

Consumer:

export RABBITMQ_STREAM_PERF_TEST_LOGGERS=com.rabbitmq.stream=debug
stream-perf-test --streams stream-max-length --max-length-bytes 250MB \
              --consumers 1 \
              --producers 0

I got the same unexpected_chunk_id error:

2021-09-01 09:04:19.544241+02:00 [debug] <0.2028.0> Open frame received for /
2021-09-01 09:04:19.544367+02:00 [debug] <0.2028.0> sending open response ok /
2021-09-01 09:04:19.544476+02:00 [debug] <0.2028.0> Transitioned from tuned to opened
2021-09-01 09:04:19.547956+02:00 [debug] <0.2028.0> Creating subscription 0 to <<"stream-max-length">>, with offset specification next, properties #{}
2021-09-01 09:04:19.548103+02:00 [debug] <0.2028.0> osiris: initialising reader. Spec: next
2021-09-01 09:04:19.549247+02:00 [debug] <0.2028.0> osiris_log: segment 00000000000002742167.segment has trailing data 409867736 410124552
2021-09-01 09:04:19.549334+02:00 [debug] <0.2028.0> osiris_log:build_log_overview/1 completed in 0.001086s
2021-09-01 09:04:19.549406+02:00 [debug] <0.2028.0> osiris_log:init_offset_reader/2 spec next range {2742167,2844310}
2021-09-01 09:04:19.549591+02:00 [debug] <0.2028.0> osiris_log:scan_idx/2 completed in 0.000105s
2021-09-01 09:04:19.549659+02:00 [debug] <0.2028.0> Next offset for subscription 0 is 2844311
2021-09-01 09:04:19.549720+02:00 [debug] <0.2028.0> Distributing existing messages to subscription 0
2021-09-01 09:04:19.551240+02:00 [debug] <0.2028.0> rabbit_stream_reader terminating in state 'open' with reason '{badmatch,
2021-09-01 09:04:19.551240+02:00 [debug] <0.2028.0>                                                                {error,
2021-09-01 09:04:19.551240+02:00 [debug] <0.2028.0>                                                                 {unexpected_chunk_id,
2021-09-01 09:04:19.551240+02:00 [debug] <0.2028.0>                                                                  2844375,
2021-09-01 09:04:19.551240+02:00 [debug] <0.2028.0>                                                                  2844311}}}'
2021-09-01 09:04:19.551566+02:00 [debug] <0.2034.0> Closing all channels from connection '127.0.0.1:49188 -> 127.0.1.1:5552' because it has been closed
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0> ** State machine <0.2028.0> terminating
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0> ** Last event = {info,{tcp,#Port<0.255>,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            <<0,0,0,32,0,7,0,1,0,0,0,4,0,0,17,115,116,114,101,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                              97,109,45,109,97,120,45,108,101,110,103,116,104,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                              0,3,0,10>>}}
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0> ** When server state  = {open,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                          {statem_data,ranch_tcp,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                           {stream_connection,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            <<"127.0.0.1:49188 -> 127.0.1.1:5552">>,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            {0,0,0,0,0,65535,32512,257},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            {0,0,0,0,0,65535,32512,1},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            5552,49188,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            {<<"PLAIN">>,rabbit_auth_mechanism_plain},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            1630479859525,<0.2027.0>,#Port<0.255>,#{},#{},#{},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            #{},#Ref<0.2392243111.581566465.119749>,done,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            {user,<<"guest">>,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                             [administrator],
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                             [{rabbit_auth_backend_internal,none}]},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            <<"/">>,opened,1048576,60,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            {<0.2029.0>,<0.2030.0>},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            #{<<"connection_name">> =>
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                               <<"rabbitmq-stream-consumer">>,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                              <<"copyright">> =>
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                               <<"Copyright (c) 2020-2021 VMware, Inc. or its affiliates.">>,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                              <<"information">> =>
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                               <<"Licensed under the MPL 2.0. See https://www.rabbitmq.com/">>,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                              <<"platform">> => <<"Java">>,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                              <<"product">> => <<"RabbitMQ Stream">>,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                              <<"version">> => <<"0.4.0-SNAPSHOT">>},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            #{},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            {state,fine,5000,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                             #Ref<0.2392243111.581435394.107508>},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            false,#Ref<0.2392243111.581566465.119750>,tcp},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                           {stream_connection_state,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            {rabbit_stream_core,{cfg},[],undefined,{[],[]}},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            false,#{}},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                           {configuration,50000,12500,1048576,60,10000}}}
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0> ** Reason for termination = error:{badmatch,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                                    {error,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                                     {unexpected_chunk_id,2844375,2844311}}}
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0> ** Callback modules = [rabbit_stream_reader]
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0> ** Callback mode = [state_functions,state_enter]
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0> ** Stacktrace =
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0> **  [{rabbit_stream_reader,handle_frame_post_auth,4,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            [{file,"src/rabbit_stream_reader.erl"},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                             {line,1742}]},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>      {lists,foldl,3,[{file,"lists.erl"},{line,1267}]},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>      {rabbit_stream_reader,open,3,
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>                            [{file,"src/rabbit_stream_reader.erl"},{line,651}]},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>      {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1166}]},
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]
2021-09-01 09:04:19.551388+02:00 [error] <0.2028.0> 
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>   crasher:
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>     initial call: rabbit_stream_reader:init/1
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>     pid: <0.2028.0>
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>     registered_name: []
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>     exception error: no match of right hand side value
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>                      {error,{unexpected_chunk_id,2844375,2844311}}
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>       in function  rabbit_stream_reader:handle_frame_post_auth/4 (src/rabbit_stream_reader.erl, line 1742)
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>       in call from lists:foldl/3 (lists.erl, line 1267)
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>       in call from rabbit_stream_reader:open/3 (src/rabbit_stream_reader.erl, line 651)
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>       in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 1166)
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>     ancestors: [<0.2026.0>,<0.793.0>,<0.792.0>,<0.791.0>,<0.789.0>,
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>                   <0.788.0>,rabbit_stream_sup,<0.784.0>]
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>     message_queue_len: 0
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>     messages: []
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>     links: [#Port<0.255>,<0.2026.0>]
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>     dictionary: [{permission_cache,
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>                       [{{resource,<<"/">>,queue,<<"stream-max-length">>},
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>                         #{},read}]}]
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>     trap_exit: true
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>     status: running
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>     heap_size: 17731
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>     stack_size: 28
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>     reductions: 93371
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0>   neighbours:
2021-09-01 09:04:19.552376+02:00 [error] <0.2028.0> 
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>     supervisor: {<0.2026.0>,rabbit_stream_connection_sup}
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>     errorContext: child_terminated
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>     reason: {{badmatch,{error,{unexpected_chunk_id,2844375,2844311}}},
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>              [{rabbit_stream_reader,handle_frame_post_auth,4,
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                                     [{file,"src/rabbit_stream_reader.erl"},
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                                      {line,1742}]},
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>               {lists,foldl,3,[{file,"lists.erl"},{line,1267}]},
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>               {rabbit_stream_reader,open,3,
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                                     [{file,"src/rabbit_stream_reader.erl"},
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                                      {line,651}]},
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>               {gen_statem,loop_state_callback,11,
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                           [{file,"gen_statem.erl"},{line,1166}]},
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>               {proc_lib,init_p_do_apply,3,
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                         [{file,"proc_lib.erl"},{line,226}]}]}
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>     offender: [{pid,<0.2028.0>},
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                {id,rabbit_stream_reader},
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                {mfargs,
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                    {rabbit_stream_reader,start_link,
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                        [<0.2027.0>,ranch_tcp,
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                         {acceptor,{0,0,0,0,0,0,0,0},5552},
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                         #{credits_required_for_unblocking => 12500,
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                           frame_max => 1048576,heartbeat => 60,
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                           initial_credits => 50000,transport => tcp}]}},
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                {restart_type,intrinsic},
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                {shutdown,30000},
2021-09-01 09:04:19.553022+02:00 [error] <0.2026.0>                {child_type,worker}]
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>     supervisor: {<0.2026.0>,rabbit_stream_connection_sup}
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>     errorContext: shutdown
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>     reason: reached_max_restart_intensity
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>     offender: [{pid,<0.2028.0>},
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>                {id,rabbit_stream_reader},
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>                {mfargs,
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>                    {rabbit_stream_reader,start_link,
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>                        [<0.2027.0>,ranch_tcp,
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>                         {acceptor,{0,0,0,0,0,0,0,0},5552},
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>                         #{credits_required_for_unblocking => 12500,
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>                           frame_max => 1048576,heartbeat => 60,
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>                           initial_credits => 50000,transport => tcp}]}},
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>                {restart_type,intrinsic},
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>                {shutdown,30000},
2021-09-01 09:04:19.553407+02:00 [error] <0.2026.0>                {child_type,worker}]
2021-09-01 09:04:21.796442+02:00 [debug] <0.868.0> osiris_log: open_new_segment : 00000000000002866801.segment
2021-09-01 09:04:21.798071+02:00 [debug] <0.221.0> osiris_log:build_log_overview/1 completed in 0.000551s
2021-09-01 09:04:21.798185+02:00 [debug] <0.221.0> osiris_log: deleting segment 00000000000002742167.segment in /tmp/rabbitmq-test-instances/rabbit@acogoluegnes-nuc/mnesia/rabbit@acogoluegnes-nuc/stream/__stream-max-length_1630479402265057554

I got also another one, invalid_chunk_header:

2021-09-01 08:57:11.895972+02:00 [debug] <0.961.0> Open frame received for /
2021-09-01 08:57:11.896077+02:00 [debug] <0.961.0> sending open response ok /
2021-09-01 08:57:11.896158+02:00 [debug] <0.961.0> Transitioned from tuned to opened
2021-09-01 08:57:11.897813+02:00 [debug] <0.961.0> Creating subscription 0 to <<"stream-max-length">>, with offset specification next, properties #{}
2021-09-01 08:57:11.897893+02:00 [debug] <0.961.0> osiris: initialising reader. Spec: next
2021-09-01 08:57:11.898652+02:00 [debug] <0.961.0> osiris_log:build_log_overview/1 completed in 0.000673s
2021-09-01 08:57:11.898727+02:00 [debug] <0.961.0> osiris_log:init_offset_reader/2 spec next range {124593,177291}
2021-09-01 08:57:11.898880+02:00 [debug] <0.961.0> osiris_log:scan_idx/2 completed in 0.000074s
2021-09-01 08:57:11.898965+02:00 [debug] <0.961.0> Next offset for subscription 0 is 177292
2021-09-01 08:57:11.899052+02:00 [debug] <0.961.0> Distributing existing messages to subscription 0
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0> rabbit_stream_reader terminating in state 'open' with reason '{badmatch,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                {error,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                 {invalid_chunk_header,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                  {ok,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                   <<0,0,0,0,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                     0,0,0,0,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                     0,0,0,0,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                     0,0,0,0,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                     0,0,0,0,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                     0,0,0,0,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                     0,0,0,0,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                     0,0,0,0,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                     0,0,0,0,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                     0,0,0,0,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                     0,0,0,0,
2021-09-01 08:57:11.899234+02:00 [debug] <0.961.0>                                                                     0,0,0,0>>}}}}'
2021-09-01 08:57:11.899480+02:00 [debug] <0.967.0> Closing all channels from connection '127.0.0.1:49038 -> 127.0.1.1:5552' because it has been closed
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0> ** State machine <0.961.0> terminating
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0> ** Last event = {info,{tcp,#Port<0.68>,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            <<0,0,0,32,0,7,0,1,0,0,0,4,0,0,17,115,116,114,101,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                              97,109,45,109,97,120,45,108,101,110,103,116,104,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                              0,3,0,10>>}}
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0> ** When server state  = {open,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                          {statem_data,ranch_tcp,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                           {stream_connection,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            <<"127.0.0.1:49038 -> 127.0.1.1:5552">>,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            {0,0,0,0,0,65535,32512,257},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            {0,0,0,0,0,65535,32512,1},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            5552,49038,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            {<<"PLAIN">>,rabbit_auth_mechanism_plain},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            1630479431878,<0.960.0>,#Port<0.68>,#{},#{},#{},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            #{},#Ref<0.2392243111.581566466.103515>,done,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            {user,<<"guest">>,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                             [administrator],
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                             [{rabbit_auth_backend_internal,none}]},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            <<"/">>,opened,1048576,60,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            {<0.962.0>,<0.963.0>},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            #{<<"connection_name">> =>
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                               <<"rabbitmq-stream-consumer">>,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                              <<"copyright">> =>
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                               <<"Copyright (c) 2020-2021 VMware, Inc. or its affiliates.">>,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                              <<"information">> =>
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                               <<"Licensed under the MPL 2.0. See https://www.rabbitmq.com/">>,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                              <<"platform">> => <<"Java">>,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                              <<"product">> => <<"RabbitMQ Stream">>,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                              <<"version">> => <<"0.4.0-SNAPSHOT">>},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            #{},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            {state,fine,5000,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                             #Ref<0.2392243111.581435393.107534>},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            false,#Ref<0.2392243111.581566466.103516>,tcp},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                           {stream_connection_state,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            {rabbit_stream_core,{cfg},[],undefined,{[],[]}},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            false,#{}},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                           {configuration,50000,12500,1048576,60,10000}}}
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0> ** Reason for termination = error:{badmatch,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                                    {error,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                                     {invalid_chunk_header,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                                      {ok,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                                       <<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                                         0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                                         0,0,0,0,0,0,0,0,0,0,0,0,0,0>>}}}}
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0> ** Callback modules = [rabbit_stream_reader]
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0> ** Callback mode = [state_functions,state_enter]
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0> ** Stacktrace =
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0> **  [{rabbit_stream_reader,handle_frame_post_auth,4,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            [{file,"src/rabbit_stream_reader.erl"},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                             {line,1742}]},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>      {lists,foldl,3,[{file,"lists.erl"},{line,1267}]},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>      {rabbit_stream_reader,open,3,
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>                            [{file,"src/rabbit_stream_reader.erl"},{line,651}]},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>      {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1166}]},
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]
2021-09-01 08:57:11.899562+02:00 [error] <0.961.0> 
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>   crasher:
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>     initial call: rabbit_stream_reader:init/1
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>     pid: <0.961.0>
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>     registered_name: []
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>     exception error: no match of right hand side value
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>                      {error,
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>                          {invalid_chunk_header,
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>                              {ok,<<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>                                    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>                                    0,0,0,0,0,0,0,0>>}}}
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>       in function  rabbit_stream_reader:handle_frame_post_auth/4 (src/rabbit_stream_reader.erl, line 1742)
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>       in call from lists:foldl/3 (lists.erl, line 1267)
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>       in call from rabbit_stream_reader:open/3 (src/rabbit_stream_reader.erl, line 651)
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>       in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 1166)
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>     ancestors: [<0.959.0>,<0.793.0>,<0.792.0>,<0.791.0>,<0.789.0>,<0.788.0>,
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>                   rabbit_stream_sup,<0.784.0>]
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>     message_queue_len: 0
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>     messages: []
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>     links: [#Port<0.68>,<0.959.0>]
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>     dictionary: [{permission_cache,
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>                       [{{resource,<<"/">>,queue,<<"stream-max-length">>},
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>                         #{},read}]}]
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>     trap_exit: true
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>     status: running
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>     heap_size: 28690
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>     stack_size: 28
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>     reductions: 100963
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0>   neighbours:
2021-09-01 08:57:11.900387+02:00 [error] <0.961.0> 
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>     supervisor: {<0.959.0>,rabbit_stream_connection_sup}
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>     errorContext: child_terminated
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>     reason: {{badmatch,
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                  {error,
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                      {invalid_chunk_header,
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                          {ok,<<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                                0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                                0,0,0,0>>}}}},
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>              [{rabbit_stream_reader,handle_frame_post_auth,4,
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                   [{file,"src/rabbit_stream_reader.erl"},{line,1742}]},
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>               {lists,foldl,3,[{file,"lists.erl"},{line,1267}]},
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>               {rabbit_stream_reader,open,3,
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                   [{file,"src/rabbit_stream_reader.erl"},{line,651}]},
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>               {gen_statem,loop_state_callback,11,
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                   [{file,"gen_statem.erl"},{line,1166}]},
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>               {proc_lib,init_p_do_apply,3,
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                   [{file,"proc_lib.erl"},{line,226}]}]}
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>     offender: [{pid,<0.961.0>},
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                {id,rabbit_stream_reader},
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                {mfargs,
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                    {rabbit_stream_reader,start_link,
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                        [<0.960.0>,ranch_tcp,
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                         {acceptor,{0,0,0,0,0,0,0,0},5552},
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                         #{credits_required_for_unblocking => 12500,
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                           frame_max => 1048576,heartbeat => 60,
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                           initial_credits => 50000,transport => tcp}]}},
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                {restart_type,intrinsic},
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                {shutdown,30000},
2021-09-01 08:57:11.900942+02:00 [error] <0.959.0>                {child_type,worker}]
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>     supervisor: {<0.959.0>,rabbit_stream_connection_sup}
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>     errorContext: shutdown
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>     reason: reached_max_restart_intensity
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>     offender: [{pid,<0.961.0>},
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>                {id,rabbit_stream_reader},
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>                {mfargs,
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>                    {rabbit_stream_reader,start_link,
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>                        [<0.960.0>,ranch_tcp,
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>                         {acceptor,{0,0,0,0,0,0,0,0},5552},
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>                         #{credits_required_for_unblocking => 12500,
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>                           frame_max => 1048576,heartbeat => 60,
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>                           initial_credits => 50000,transport => tcp}]}},
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>                {restart_type,intrinsic},
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>                {shutdown,30000},
2021-09-01 08:57:11.901367+02:00 [error] <0.959.0>                {child_type,worker}]
2021-09-01 08:57:16.942323+02:00 [info] <0.976.0> Supervisor {<0.976.0>,rabbit_stream_connection_sup}: child rabbit_stream_keepalive_sup started (<0.977.0>): {rabbit_stream_connection_sup,start_keepalive_link,[]}

There is more detail here: https://github.com/rabbitmq/opportunities/issues/99#issuecomment-910043659

gerhard commented 3 years ago

Thanks for the tips @kjnilsson !

I’ve got a feeling the only correct / practical way to resolve next is by resolving last then immediately read and throw away the last chunk. That should position next as expected. I’m really not sure how it works at all atm.

Also remember this error has also been observed with last.

mkuratczyk commented 3 years ago

I believe this is what happens:

  1. When next is requested, scan_idx returns the expected next chunk ID (last existing + 1) and eof as its (expected) location
  2. The problem is that going to the eof is not idempotent - if you go to the eof position, you may end up in different places every time, if the file is changing (which it is when publishers are running)
  3. That's exactly what happens between the moment scan_idx returns and the file descriptor is positioned in the location returned by scan_idx. If a new chunk is appended to the segment, eof location in the segment file returned by scan_idx is not the same location that is used to position the file descriptor. That's why we read an unexpected chunk.

I see a few options:

I personally like the second option more because it solves another problem that I think we have - I don't think ChunkIds are guaranteed to be perfectly consecutive (if they are, ignore this issue). scan_idx currently assumes that the next ChunkId will be exactly the last one + 1 (last existing chunk start + number of entries in that chunk). Even if scan_idx returned the location in the segment as an integer, I believe it could happen that the ChunkId at that location won't be the expected ChunkId. That would lead to the same error, although much less often than what we see right now.

Lastly, I haven't tested this but I wonder what happens if the last existing chunk in the segment is actually the last one that will be in that segment. There could be another race condition where scan_idx suggests the next chunk will be in the same segment file, just after the currently last chunk, but in reality the next chunk will be written to a new segment file.

kjnilsson commented 3 years ago

Ok so this PR is merged https://github.com/rabbitmq/osiris/pull/50

@acogoluegnes has done some local testing and could not repro this issue

kjnilsson commented 3 years ago

@Gsantomaggio and I have been testing this on a 3 node cluster in the opportunity-99 env and cannot generate any errors including this even with very targeted consumer apps that periodically crash and re-attach using a random offset specification. We're happy that this issue is fixed / substantially improved.

gerhard commented 3 years ago

Has https://github.com/rabbitmq/osiris/pull/50 been back ported to v3.9.x? If yes, we should add it to the release notes of the version that we expect it to ship in, e.g. https://github.com/rabbitmq/rabbitmq-server/blob/master/release-notes/3.9.6.md

kjnilsson commented 3 years ago

That's a good point. I'm guessing we need a new tag for it and update 3.9.x accordingly?

kjnilsson commented 3 years ago

https://github.com/rabbitmq/osiris/releases/tag/v1.2.0

new tag

kjnilsson commented 3 years ago

https://github.com/rabbitmq/rabbitmq-server/pull/3417