Closed mkuratczyk closed 1 year ago
I've put together a script that reproduces this locally:
#!/bin/bash
# start a 3-node cluster with OTP25
source ~/.kerl/25.3.1/activate
bazel clean
bazel run start-cluster
# stop rabbit-0
rabbitmqctl -n rabbit-0 shutdown
# publish some messages
java -jar perf-test-dev.jar -H amqp://localhost:5673 -qq -u qq -c 500 -ms -z 30
# start rabbit-0 on OTP26
source ~/.kerl/26.0-rc3/activate
bazel run start-cluster NODES=1
The problem is caused by different map ordering in OTP26. Ra snapshot metadata is a map that is later serialized with term_to_binary
and is a part of data that the checksum is calculated on. Due to different map ordering, with OTP26 the elements of the map are written in a different order and therefore lead to a different checksum.
OTP 25.3.1
1> Meta = #{cluster => [{'%2F_qq','rabbit-0@mkuratczykPF0JR'}, {'%2F_qq','rabbit-1@mkuratczykPF0JR'}, {'%2F_qq','rabbit-2@mkuratczykPF0JR'}], index => 611960,machine_version => 3,term => 1}.
#{index => 611960,term => 1,
cluster =>
[{'%2F_qq','rabbit-0@mkuratczykPF0JR'},
{'%2F_qq','rabbit-1@mkuratczykPF0JR'},
{'%2F_qq','rabbit-2@mkuratczykPF0JR'}],
machine_version => 3}
2> MetaBin = erlang:term_to_binary(Meta).
<<131,116,0,0,0,4,119,5,105,110,100,101,120,98,0,9,86,120,
119,4,116,101,114,109,97,1,119,7,99,...>>
3> erlang:crc32(MetaBin).
2066562623
OTP 26.0-rc3
1> Meta = #{cluster => [{'%2F_qq','rabbit-0@mkuratczykPF0JR'}, {'%2F_qq','rabbit-1@mkuratczykPF0JR'}, {'%2F_qq','rabbit-2@mkuratczykPF0JR'}], index => 611960,machine_version => 3,term => 1}.
#{cluster =>
[{'%2F_qq','rabbit-0@mkuratczykPF0JR'},
{'%2F_qq','rabbit-1@mkuratczykPF0JR'},
{'%2F_qq','rabbit-2@mkuratczykPF0JR'}],
index => 611960,machine_version => 3,term => 1}
2> MetaBin = erlang:term_to_binary(Meta).
<<131,116,0,0,0,4,100,0,7,99,108,117,115,116,101,114,108,
0,0,0,3,104,2,100,0,6,37,50,70,...>>
3> erlang:crc32(MetaBin).
3828560182
https://github.com/rabbitmq/rabbitmq-server/pull/8143 makes rolling upgrades to Erlang 26 succeed under a constant load involving QQs.
Describe the bug
NOTE: RabbitMQ does not support OTP26 yet. This issue should not affect any users.
When performing a rolling upgrade from OTP25 to OTP26, the first node running OTP26 to re-join the cluster will not be able to accept Ra snapshots:
Full log file: upgrade.log.gz
Reproduction steps
main
branch or3.11
, I will usemain
)perf-test -x 10 -y 10 -r 5000 -c 500 -qp fivers-%d -qpf 1 -qpt 10 -qa x-max-length=1000000
)Expected behavior
Successful upgrade :)
Additional context
No response