phaistos-networks / TANK

A very high performance distributed log service
Apache License 2.0
940 stars 70 forks source link

Corrupt Files Produce Core Dumps #65

Closed gregschrock closed 5 years ago

gregschrock commented 6 years ago

Hello,

I've been doing some tests that involve pulling the plug on our product, and so also on TANK. With this approach, it has been very easy to get TANK into a non-functional state.

There are essentially three variations of failures I've seen.

Recoverable Segment/Index Corruption

This is of course ideal. It is a scenario that can easily be addressed by following the steps described in Troubleshooting. There are no core dumps. Not a big deal.

Zero Byte Logs

There have been a number of cases where the active segment ends up as a zero byte file (logs causing the failure attached). The failure I've seen looks as follows:

$ tank -p zero-byte-log/ -l 127.0.0.1:11011
tank: /builddir/build/BUILD/tank/service.cpp:3051: Switch::shared_refptr<topic_partition> Service::init_local_partition(uint16_t, const char*, const partition_config&): Assertion `l->lastAssignedSeqNum >= l->cur.baseSeqNum' failed.
Aborted (core dumped)

Non-Empty Corrupt Logs

The other failure case is one I've only seen once and I can't say the exact situation that led to it. I don't believe the machine that produced these logs was unplugged as in the other cases.

This one is a corrupt segment that has data but is not recoverable through TANK_FORCE_SALVAGE_CURSEGMENT. It looks like (logs attached once again):

$ tank -p corrupt-log/ -l 127.0.0.1:11011
Failed to initialize topics and partitions:pread64() failed:Success

Steps to Address

The only way I've found to address the last two failures is deleting the active segment and index files for the offending topic(s).

If you could give me some less intrusive steps to recovery, that would be great. It would also be preferable if TANK could determine that the segment or index is not as expected and fail gracefully rather than core dumping.

Thanks corrupt-log.tar.gz zero-byte-log.tar.gz

markpapadakis commented 6 years ago

Thank you Greg. I am going to look into them soon, and as long as they are reproducible they will definitely be addressed.

krconv commented 5 years ago

I think I've hit this problem too; maybe we could abort instead of assert, so that TANK doesn't produce a coredump?

markpapadakis commented 5 years ago

You may want to check the new tank2 branch.

I finally got some spare time and I pushed it upstream. We have had no issues with it (client, service are rewritten; support clustering etc, documents forthcoming). You may want to use the new client and service as a drop-in replacement, although maybe you should only do this for some private datasets first because it hasn't really been tested enough by anyone else other than us.

This release should fix all those issues, and then some. Feedback is most welcome :)

markpapadakis commented 5 years ago

TANK2 should take care of this issue (and other such issues).