db, shelf: allow repairing the db by dropping corrupted data

Assuming that the shelf magic is correct (we fsynced the header), the contents of the shelf might be corrupted in a variety of ways:

The shelf's length is invalid, as in the last item is not a whole slot.
The shelf's length is valid, but within a slot, the item header is invalid (defines an items that does not fit the slot).
The shelf's length and slot item header is correct, but the data itself is corrupted.

The 3rd case is not Billy's problem. Billy is a "dumb" data store, so if something got corrupted whilst maintaining the database schema, the outer process needs to deal with it. In Geth's case, this is handled: if a blob cannot be decoded, it will be dropped.

The first 2 cases however need to be handled by Billy and currently are not. In both cases, Billy will fail on startup when iterating the content. Whilst we could argue that failing and letting the outer user resolve it is not a bad thing, there's also no real way to automatically resolve these by Geth: we don't want to know the internal structure of the db, and going in hot deleting an entire folder seems a nuclear option.

This PR instead adds a repair mode into Billy. If the database is opened in RW mode and repairing is requested, then the above two scenarios will be fixed:

If the shelf content is not a multiple of item sizes, the last (partial) item will be truncated (losing the data, no other way).
If the shelf content is a good multiple (or was already truncated), but the item header is corrupted, then it will be considered a gap during the opening "ceremony" and will be silently compacted out.

In both these scenarios, the repair is destructive. That said, for Geth's use case this is fine, but even more in general, if the database dat format is borked, theres not much more we can do really. A partial data loss seems preferable vs a total data loss.

holiman / billy

db, shelf: allow repairing the db by dropping corrupted data #23

Codecov Report