rabbitmq / rabbitmq-server

Open source RabbitMQ: core server and tier 1 (built-in) plugins
https://www.rabbitmq.com/
Other
12.19k stars 3.91k forks source link

Upgrading a cluster with quorum queues to OTP26 fails with checksum mismatch #8057

Closed mkuratczyk closed 1 year ago

mkuratczyk commented 1 year ago

Describe the bug

NOTE: RabbitMQ does not support OTP26 yet. This issue should not affect any users.

When performing a rolling upgrade from OTP25 to OTP26, the first node running OTP26 to re-join the cluster will not be able to accept Ra snapshots:

** Reason for termination = error:{badmatch,1522447362}
** Callback modules = [ra_server_proc]
** Callback mode = [state_functions,state_enter]
** Stacktrace =
**  [{ra_log_snapshot,complete_accept,2,
                      [{file,"ra_log_snapshot.erl"},{line,85}]},
     {ra_snapshot,accept_chunk,4,[{file,"ra_snapshot.erl"},{line,281}]},
     {ra_server,handle_receive_snapshot,2,
                [{file,"ra_server.erl"},{line,1227}]},
     {ra_server_proc,handle_receive_snapshot,2,
                     [{file,"ra_server_proc.erl"},{line,1052}]},
     {ra_server_proc,receive_snapshot,3,
                     [{file,"ra_server_proc.erl"},{line,805}]},
     {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1377}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]
** Time-outs: {2,
               [{state_timeout,receive_snapshot_timeout},
                {{timeout,tick},tick_timeout}]}
** Client <18278.7917.0> is remote on node 'rabbit@qq-s1000-server-0.qq-s1000-nodes.qq'

  crasher:
    initial call: ra_server_proc:init/1
    pid: <0.632.0>
    registered_name: '%2F_fivers-8'
    exception error: no match of right hand side value 1522447362
      in function  ra_log_snapshot:complete_accept/2 (ra_log_snapshot.erl, line 85)
      in call from ra_snapshot:accept_chunk/4 (ra_snapshot.erl, line 281)
      in call from ra_server:handle_receive_snapshot/2 (ra_server.erl, line 1227)
      in call from ra_server_proc:handle_receive_snapshot/2 (ra_server_proc.erl, line 1052)
      in call from ra_server_proc:receive_snapshot/3 (ra_server_proc.erl, line 805)
      in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 1377)

Full log file: upgrade.log.gz

Reproduction steps

  1. Deploy RabbitMQ with OTP25 (it can be the main branch or 3.11, I will use main)
  2. Deploy some quorum queue workload (I'm using perf-test -x 10 -y 10 -r 5000 -c 500 -qp fivers-%d -qpf 1 -qpt 10 -qa x-max-length=1000000)
  3. Perform a rolling upgrade to OTP26

Expected behavior

Successful upgrade :)

Additional context

No response

mkuratczyk commented 1 year ago

I've put together a script that reproduces this locally:

#!/bin/bash

# start a 3-node cluster with OTP25
source ~/.kerl/25.3.1/activate
bazel clean
bazel run start-cluster

# stop rabbit-0
rabbitmqctl -n rabbit-0 shutdown

# publish some messages
java -jar perf-test-dev.jar -H amqp://localhost:5673 -qq -u qq -c 500 -ms -z 30

# start rabbit-0 on OTP26
source ~/.kerl/26.0-rc3/activate
bazel run start-cluster NODES=1
mkuratczyk commented 1 year ago

The problem is caused by different map ordering in OTP26. Ra snapshot metadata is a map that is later serialized with term_to_binary and is a part of data that the checksum is calculated on. Due to different map ordering, with OTP26 the elements of the map are written in a different order and therefore lead to a different checksum.

OTP 25.3.1

1> Meta = #{cluster => [{'%2F_qq','rabbit-0@mkuratczykPF0JR'}, {'%2F_qq','rabbit-1@mkuratczykPF0JR'}, {'%2F_qq','rabbit-2@mkuratczykPF0JR'}], index => 611960,machine_version => 3,term => 1}.
#{index => 611960,term => 1,
  cluster =>
      [{'%2F_qq','rabbit-0@mkuratczykPF0JR'},
       {'%2F_qq','rabbit-1@mkuratczykPF0JR'},
       {'%2F_qq','rabbit-2@mkuratczykPF0JR'}],
  machine_version => 3}
2> MetaBin = erlang:term_to_binary(Meta).
<<131,116,0,0,0,4,119,5,105,110,100,101,120,98,0,9,86,120,
  119,4,116,101,114,109,97,1,119,7,99,...>>
3> erlang:crc32(MetaBin).
2066562623

OTP 26.0-rc3

1> Meta = #{cluster => [{'%2F_qq','rabbit-0@mkuratczykPF0JR'}, {'%2F_qq','rabbit-1@mkuratczykPF0JR'}, {'%2F_qq','rabbit-2@mkuratczykPF0JR'}], index => 611960,machine_version => 3,term => 1}.
#{cluster =>
      [{'%2F_qq','rabbit-0@mkuratczykPF0JR'},
       {'%2F_qq','rabbit-1@mkuratczykPF0JR'},
       {'%2F_qq','rabbit-2@mkuratczykPF0JR'}],
  index => 611960,machine_version => 3,term => 1}
2> MetaBin = erlang:term_to_binary(Meta).
<<131,116,0,0,0,4,100,0,7,99,108,117,115,116,101,114,108,
  0,0,0,3,104,2,100,0,6,37,50,70,...>>
3> erlang:crc32(MetaBin).
3828560182
michaelklishin commented 1 year ago

https://github.com/rabbitmq/rabbitmq-server/pull/8143 makes rolling upgrades to Erlang 26 succeed under a constant load involving QQs.