Open Ngoguey42 opened 2 years ago
Thanks for the write up (great!). Indeed, unsafe_to_string
needs to be used with great care. Calling Entry.decode (Bytes.unsafe_to_string scratch.buffer)
requires that you understand exactly how Entry.decode
works (including all the dependent code that might be called), and that you guarantee this use will be correct, not only from the point you write this expression, but also through all possible evolutions of all the dependent libraries involved. I just don't see how one can reasonably give that guarantee in this case. At the very least, all uses of Bytes.unsafe_to_string
should either be obviously valid (i.e. looking locally at the lines involved can guarantee the use is correct), or there should be a comment as to why the use is valid.
@icristescu and I have been investigating a bug found by Victor Allombert on https://gitlab.com/tezos/tezos/-/merge_requests/4862#note_908551016. He found out that the snapshot export produced larger snapshots when using
--on-disk _
. The produced snapshots were completely functional.The snapshot export process makes use of a hashset for deduplication of entries. The default mode uses an in-memory implementation and the
--on-disk
mode uses the index library withKey
as hashes and an emptyValue
segment. It appeared that index lost track of some entries which resulted in less deduplications (hence the larger snapshots).This loss of index entries was caused by a very unsafe use of
Bytes.unsafe_to_string : bytes -> string
which made the assumption that the consumer of thestring
would not keep a reference to it. In our case thebytes
is a read buffer and the consumer was a binary decoder.Detailed call stack of the problem
When reading the log entries, index uses a buffer that is reused for decoding all entries (allocated here https://github.com/mirage/index/blob/4decebdff182cd101fb0246b651c260b82a951e3/src/log_file.ml#L209)
Decoding a log entry
e
is done that way:Entry.decode (Bytes.unsafe_to_string scratch.buffer)
(here https://github.com/mirage/index/blob/4decebdff182cd101fb0246b651c260b82a951e3/src/log_file.ml#L76-L80)Entry.decode
usesKey.decode
(here https://github.com/mirage/index/blob/4decebdff182cd101fb0246b651c260b82a951e3/src/data.ml#L73-L76)Key.decode
isRepr.decode_bin Hash.t
(here https://github.com/mirage/irmin/blob/b1c2518017be7faa0205a9238fd4085149643755/src/irmin-pack/unix/pack_index.ml#L23-L29)val Hash.t
is implemented usingRepr.(string_of (`Fixed _))
that relies onDigestif.Make_BLAKE2B(_).of_raw_string
(here https://github.com/mirage/irmin/blob/b1c2518017be7faa0205a9238fd4085149643755/src/irmin-tezos/schema.ml#L45)On the repr side, the decode function for fixed size strings creates a copy of the input string in most situations except when
pos = 0 && len = String.length buf
where it doesn't (here https://github.com/mirage/repr/blob/401e6f3391d26bd389aeea6e6a62b31cfaaf2fa8/src/repr/binary.ml#L228)On the digestif side,
Digestif.of_raw_string
is the identity function (here https://github.com/mirage/digestif/blob/a94075cfec70fbcb31e15f94e28cb715c9e4d9dd/src/digestif_conv.ml#L93-L96)The result of all this is that
e.key == Bytes.unsafe_to_string scratch.buffer
.It did not occur before because index must be asked to work with empty
Values
for thepos = 0 && len = String.length buf
path to be taken.Short term solutions
A copy of the input buffer must be made at some point.
Copying from the
Key
functor in Irmin terribly slowed down the process. The decoding of key must stay lightning fast because index relies on interpolation search find entries in the "data" file. See https://github.com/mirage/irmin/pull/1811 .Copying from the faulty
entry_of_offset
function in Index didn't seem to affect the performances. See https://github.com/mirage/index/pull/387 .Long term solutions
See #1815