Closed gomoripeti closed 1 year ago
Willing to add a test case later if the patch looks ok.
@gomoripeti please do, if anything, this will help us reviewers understand how the problem can be triggered.
I was thinking more of a test case where I manually delete the last segment file. The issue was triggered in production with an OOM and in test env more like with a load test like setup:
Will look at adding the test with manual delete next week.
@gomoripeti producing an external test we can run N times should be enough. This repo does not have RabbitMQ-specific bits and tests that kill nodes repeatedly tend to have the most flakes.
thanks Karl for the confirmations
After RabbitMQ was shut down abruptily the last index file did not have a segment file. In this state the stream cannot start up and crashes in a cicle. (was reproduced with osiris 1.6.2 but the code looks the same on main)
It was found that the
missing_file
is thrown fromosiris_log:maybe_fix_corrupted_files/1
In this case just delete the index file as no data of the missing segment file can be recovered. (same as if the segment file would exist but withou any chunks)
It is possible that this is not the correct solution and the last segment file should not be missing at all.