mirage / irmin

Irmin is a distributed database that follows the same design principles as Git
https://irmin.org
ISC License
1.85k stars 155 forks source link

irmin-pack: Export on-disk doesn't deduplicate correctly #1814

Open Ngoguey42 opened 2 years ago

Ngoguey42 commented 2 years ago

@icristescu and I have been investigating a bug found by Victor Allombert on https://gitlab.com/tezos/tezos/-/merge_requests/4862#note_908551016. He found out that the snapshot export produced larger snapshots when using --on-disk _. The produced snapshots were completely functional.

The snapshot export process makes use of a hashset for deduplication of entries. The default mode uses an in-memory implementation and the --on-disk mode uses the index library with Key as hashes and an empty Value segment. It appeared that index lost track of some entries which resulted in less deduplications (hence the larger snapshots).

This loss of index entries was caused by a very unsafe use of Bytes.unsafe_to_string : bytes -> string which made the assumption that the consumer of the string would not keep a reference to it. In our case the bytes is a read buffer and the consumer was a binary decoder.

Detailed call stack of the problem

When reading the log entries, index uses a buffer that is reused for decoding all entries (allocated here https://github.com/mirage/index/blob/4decebdff182cd101fb0246b651c260b82a951e3/src/log_file.ml#L209)

Decoding a log entry e is done that way: Entry.decode (Bytes.unsafe_to_string scratch.buffer) (here https://github.com/mirage/index/blob/4decebdff182cd101fb0246b651c260b82a951e3/src/log_file.ml#L76-L80)

Entry.decode uses Key.decode (here https://github.com/mirage/index/blob/4decebdff182cd101fb0246b651c260b82a951e3/src/data.ml#L73-L76)

Key.decode is Repr.decode_bin Hash.t (here https://github.com/mirage/irmin/blob/b1c2518017be7faa0205a9238fd4085149643755/src/irmin-pack/unix/pack_index.ml#L23-L29)

val Hash.t is implemented using Repr.(string_of (`Fixed _)) that relies on Digestif.Make_BLAKE2B(_).of_raw_string (here https://github.com/mirage/irmin/blob/b1c2518017be7faa0205a9238fd4085149643755/src/irmin-tezos/schema.ml#L45)

On the repr side, the decode function for fixed size strings creates a copy of the input string in most situations except when pos = 0 && len = String.length buf where it doesn't (here https://github.com/mirage/repr/blob/401e6f3391d26bd389aeea6e6a62b31cfaaf2fa8/src/repr/binary.ml#L228)

On the digestif side, Digestif.of_raw_string is the identity function (here https://github.com/mirage/digestif/blob/a94075cfec70fbcb31e15f94e28cb715c9e4d9dd/src/digestif_conv.ml#L93-L96)

The result of all this is that e.key == Bytes.unsafe_to_string scratch.buffer.

It did not occur before because index must be asked to work with empty Values for the pos = 0 && len = String.length buf path to be taken.

Short term solutions

A copy of the input buffer must be made at some point.

Copying from the Key functor in Irmin terribly slowed down the process. The decoding of key must stay lightning fast because index relies on interpolation search find entries in the "data" file. See https://github.com/mirage/irmin/pull/1811 .

Copying from the faulty entry_of_offset function in Index didn't seem to affect the performances. See https://github.com/mirage/index/pull/387 .

Long term solutions

See #1815

tomjridge commented 2 years ago

Thanks for the write up (great!). Indeed, unsafe_to_string needs to be used with great care. Calling Entry.decode (Bytes.unsafe_to_string scratch.buffer) requires that you understand exactly how Entry.decode works (including all the dependent code that might be called), and that you guarantee this use will be correct, not only from the point you write this expression, but also through all possible evolutions of all the dependent libraries involved. I just don't see how one can reasonably give that guarantee in this case. At the very least, all uses of Bytes.unsafe_to_string should either be obviously valid (i.e. looking locally at the lines involved can guarantee the use is correct), or there should be a comment as to why the use is valid.