irmin-pack: Export on-disk doesn't deduplicate correctly

@icristescu and I have been investigating a bug found by Victor Allombert on https://gitlab.com/tezos/tezos/-/merge_requests/4862#note_908551016. He found out that the snapshot export produced larger snapshots when using --on-disk _. The produced snapshots were completely functional.

The snapshot export process makes use of a hashset for deduplication of entries. The default mode uses an in-memory implementation and the --on-disk mode uses the index library with Key as hashes and an empty Value segment. It appeared that index lost track of some entries which resulted in less deduplications (hence the larger snapshots).

This loss of index entries was caused by a very unsafe use of Bytes.unsafe_to_string : bytes -> string which made the assumption that the consumer of the string would not keep a reference to it. In our case the bytes is a read buffer and the consumer was a binary decoder.

Detailed call stack of the problem

When reading the log entries, index uses a buffer that is reused for decoding all entries (allocated here https://github.com/mirage/index/blob/4decebdff182cd101fb0246b651c260b82a951e3/src/log_file.ml#L209)

Decoding a log entry e is done that way: Entry.decode (Bytes.unsafe_to_string scratch.buffer) (here https://github.com/mirage/index/blob/4decebdff182cd101fb0246b651c260b82a951e3/src/log_file.ml#L76-L80)

Entry.decode uses Key.decode (here https://github.com/mirage/index/blob/4decebdff182cd101fb0246b651c260b82a951e3/src/data.ml#L73-L76)

Key.decode is Repr.decode_bin Hash.t (here https://github.com/mirage/irmin/blob/b1c2518017be7faa0205a9238fd4085149643755/src/irmin-pack/unix/pack_index.ml#L23-L29)

val Hash.t is implemented using Repr.(string_of (`Fixed _)) that relies on Digestif.Make_BLAKE2B(_).of_raw_string (here https://github.com/mirage/irmin/blob/b1c2518017be7faa0205a9238fd4085149643755/src/irmin-tezos/schema.ml#L45)

On the repr side, the decode function for fixed size strings creates a copy of the input string in most situations except when pos = 0 && len = String.length buf where it doesn't (here https://github.com/mirage/repr/blob/401e6f3391d26bd389aeea6e6a62b31cfaaf2fa8/src/repr/binary.ml#L228)

On the digestif side, Digestif.of_raw_string is the identity function (here https://github.com/mirage/digestif/blob/a94075cfec70fbcb31e15f94e28cb715c9e4d9dd/src/digestif_conv.ml#L93-L96)

The result of all this is that e.key == Bytes.unsafe_to_string scratch.buffer.

It did not occur before because index must be asked to work with empty Values for the pos = 0 && len = String.length buf path to be taken.

Short term solutions

A copy of the input buffer must be made at some point.

Copying from the Key functor in Irmin terribly slowed down the process. The decoding of key must stay lightning fast because index relies on interpolation search find entries in the "data" file. See https://github.com/mirage/irmin/pull/1811 .

Copying from the faulty entry_of_offset function in Index didn't seem to affect the performances. See https://github.com/mirage/index/pull/387 .

Long term solutions

See #1815

mirage / irmin