rabbitmq / osiris

Log based streaming subsystem for RabbitMQ
Other
45 stars 10 forks source link

Handle one more case of corrupt files at startup #142

Closed gomoripeti closed 1 year ago

gomoripeti commented 1 year ago

After RabbitMQ was shut down abruptily the last index file did not have a segment file. In this state the stream cannot start up and crashes in a cicle. (was reproduced with osiris 1.6.2 but the code looks the same on main)

initial call: osiris_writer:init/1
...
reason: {{case_clause,missing_file},
         [{gen_batch_server,handle_continue,2,
                            [{file,"src/gen_batch_server.erl"},
                             {line,413}]},

It was found that the missing_file is thrown from osiris_log:maybe_fix_corrupted_files/1

In this case just delete the index file as no data of the missing segment file can be recovered. (same as if the segment file would exist but withou any chunks)

It is possible that this is not the correct solution and the last segment file should not be missing at all.

gomoripeti commented 1 year ago

Willing to add a test case later if the patch looks ok.

michaelklishin commented 1 year ago

@gomoripeti please do, if anything, this will help us reviewers understand how the problem can be triggered.

gomoripeti commented 1 year ago

I was thinking more of a test case where I manually delete the last segment file. The issue was triggered in production with an OOM and in test env more like with a load test like setup:

Will look at adding the test with manual delete next week.

michaelklishin commented 1 year ago

@gomoripeti producing an external test we can run N times should be enough. This repo does not have RabbitMQ-specific bits and tests that kill nodes repeatedly tend to have the most flakes.

gomoripeti commented 1 year ago

thanks Karl for the confirmations