rabbitmq / osiris

Log based streaming subsystem for RabbitMQ
Other
45 stars 10 forks source link

osiris_replica_reader: Stop with `normal` if the leader is already gone during `init/1` #162

Closed dumbbell closed 4 months ago

dumbbell commented 5 months ago

Why

In the context of RabbitMQ, if a stream queue is deleted right after being declared, there is a chance that some Osiris processes might not be ready yet at the time the queue is deleted.

In particular, the osiris_replica_reader process monitors the given leader (an osiris_writer process in the context of a RabbitMQ stream queue) during its init/1 and that process might be stopped already.

When this happens, here is the crash that is logged:

[error] <0.1548.0> ** Generic server <0.1548.0> terminating
[error] <0.1548.0> ** Last message in was {'DOWN',#Ref<0.1118981177.1281884162.97904>,process,
[error] <0.1548.0>                                <0.1535.0>,noproc}
[error] <0.1548.0> ** When Server state == {state,
[error] <0.1548.0>                          {osiris_log,
[error] <0.1548.0>                           {cfg,
[error] <0.1548.0>                            ".../__delete_queue_1716383944197847531",
[error] <0.1548.0>                            <<"__delete_queue_1716383944197847531">>,500000000,
[error] <0.1548.0>                            256000,#{},[],
[error] <0.1548.0>                            {write_concurrency,
[error] <0.1548.0>                             #Ref<0.1118981177.1282015234.97903>},
[error] <0.1548.0>                            {osiris_replica_reader,
[error] <0.1548.0>                             {resource,<<"/">>,queue,<<"delete_queue">>},
[error] <0.1548.0>                             {127,0,0,1},
[error] <0.1548.0>                             6489},
[error] <0.1548.0>                            #Fun<osiris_writer.0.78287785>,
[error] <0.1548.0>                            #Ref<0.1118981177.1282015234.97826>,16},
[error] <0.1548.0>                           {read,data,0,tcp,all,8,undefined},
[error] <0.1548.0>                           undefined,undefined,
[error] <0.1548.0>                           {file_descriptor,prim_file,
[error] <0.1548.0>                            #{handle => #Ref<0.1118981177.1282015238.91045>,
[error] <0.1548.0>                              owner => <0.1548.0>,
[error] <0.1548.0>                              r_buffer => #Ref<0.1118981177.1282015234.97902>,
[error] <0.1548.0>                              r_ahead_size => 0}}},
[error] <0.1548.0>                          <<"__delete_queue_1716383944197847531">>,tcp,
[error] <0.1548.0>                          #Port<0.84>,<33363.1916.0>,<0.1535.0>,
[error] <0.1548.0>                          #Ref<0.1118981177.1281884162.97904>,
[error] <0.1548.0>                          {write_concurrency,
[error] <0.1548.0>                           #Ref<0.1118981177.1282015234.97903>},
[error] <0.1548.0>                          {osiris_replica_reader,
[error] <0.1548.0>                           {resource,<<"/">>,queue,<<"delete_queue">>},
[error] <0.1548.0>                           {127,0,0,1},
[error] <0.1548.0>                           6489},
[error] <0.1548.0>                          -1,0}
[error] <0.1548.0> ** Reason for termination ==
[error] <0.1548.0> ** noproc

That is because the osiris_replica_reader process receives the DOWN message from the leader monitoring with the noproc reason. It reuses the reason for its own exit reason. Because this is an abnormal reason, a crash is being logged.

How

There is no reason to log such a crash when the process tree is being shut down concurrently. osiris_replica_reader can terminate with a normal reason.

That is what this patch does: if the leader exit reason is noproc, it terminates with the normal reason instead.