lukechampine / blake3

An AVX-512 accelerated implementation of the BLAKE3 cryptographic hash function
MIT License
352 stars 25 forks source link

How to verify the digest of the whole file when digests of file parts are known #18

Open haimgel opened 1 year ago

haimgel commented 1 year ago

From my reading of Blake3 paper, and my understanding of Merkle trees in general, I think this should be doable in theory: if I have Blake3 digests of parts of a file, I should be able to calculate the digest of the whole (by building the tree of digests and then getting the root). But I don't think this module's API exposes enough to make it work, or am I missing something?

pcfreak30 commented 1 year ago

I think this is connected to how abao can verify slices of data. https://github.com/n0-computer/abao

But https://github.com/lukechampine/blake3/issues/17#issuecomment-1548498871

lukechampine commented 1 year ago

if I have Blake3 digests of parts of a file, I should be able to calculate the digest of the whole

kinda. A BLAKE3 digest is the root of a Merkle tree, but the root hash and the interior node hashes use different flags. (This was an intentional design decision.) So if you have two files A and B, there's no way to get BLAKE3(A || B) from BLAKE3(A) and BLAKE3(B).

As pcfreak said, you probably want to use Bao for this. The Bao encoding contains the interior node hashes, which lets you reconstruct the Merkle root.

haimgel commented 1 year ago

@lukechampine, thank you for your response!

Unfortunately, I cannot use Bao since Bao requires scanning the whole file in sequence, and I need to avoid it (the parts could be processed out of sequence, on different machines, etc.) I can, however, get some other aspect of an internal state of Blake3 for each part separately. I wonder if the complete blake3 digest can be constructed from just the chaining values of each part?

lukechampine commented 1 year ago

Ah. Well, as long as each part is aligned to 1024-byte boundaries of the original file, it should be possible to compute the chaining value(s) of each part and combine them into the total root. This package doesn't provide a direct way to do that, but you could achieve this by calling BaoEncode on each part and manually extracting the relevant chaining values. (This will be somewhat wasteful though, as the Bao file will contain many more chaining values than the ones you actually need.)

If I were you, I think I would fork this package and add a Stack method that returns the internal stack field. Then write a function that takes a set of stacks and computes their combined root. That way, you can calculate the chaining values for each part independently, and merge their stacks together at the end. To make your life a lot easier, ensure that the number of chunks in each part is a power of 2. That way, each part will only have a single chaining value, and you can merge them using the algorithm in mergeSubtreesGeneric.

haimgel commented 1 year ago

@lukechampine, thank you, this is very helpful! I'll try these suggestions.

haimgel commented 1 year ago

@lukechampine, I managed to get it to work, I also had to make the "counter" work right for parts other than the first one.

Would you be interested in a PR that implements a wrapper around Hasher that minimally exposes some internal state and functionality of the Hasher, similar to what the Rust crate has in their "guts" module? They do it for a very similar purpose.

lukechampine commented 1 year ago

no promises, but if you open a PR I'll definitely take a look!

lukechampine commented 3 months ago

@haimgel fyi, I've added a guts package now :)

haimgel commented 3 months ago

Thank you so much for this!