Integration with ArchiveBox and/or python in general

pirate commented 6 months ago

Howdy!

I love this tool, I've been wondering whether something like this existed for some time!

I'd love to integrate it with ArchiveBox because I think it's a perfect use-case. We have big collections of assorted internet archiving content, and it's often useful to prove a captured page (or its subresources) has existed in your collection at some point in time without revealing info about all the other sites you've archived.

ArchiveBox is very filesystem-oriented, everything it outputs is just a file on disk somewhere, and it's all under one big subdirectory, so it seems like a perfect fit. We don't integrate with IPFS/Storj yet so we haven't gotten the benefit of their baked-in hashing.

There are a lot of legal/journalism/whistleblower use-cases for internet archiving that demand some sort of chain-of-custody/integrity/history provability, and they often don't need anything fancier than a simple merkle of the content + metadata about the server/IP it was originally collected from.

I'd love to hear your thoughts/recommendations on whether this is a reasonable use case, and how to go about doing it.

A few questions:

ArchiveBox is a Python project distributed for amd64/arm64, if we can't depend on the go compiler, should we vendor or download compiled binaries of this at runtime?
When only a few files in a tree change, is it possible to compute an updated root hash without rehashing all the unchanged entries?
Are the file paths/names/permissions/modified time/created time/extended attrs/etc part of the hash, or just the file content? (we only want the content for our use case I think)
How would you recommend storing the resulting hashes and proofs so that they're easily accessible next to the original files, without becoming part of the next hash-computing run themselves? (I'm imagining a .merkdir file in each subdirectory of our file tree containing the hash of everything from that point downwards, but I'm not sure if adding those files will itself cause the root hash to change and invalidate all the previous hashes a result. Can we just ignore dot-prefixed files when hashing?)
Given a root hash, do you need access to the entire tree of original data (in the state it was in when the root hash was computed) to generate an inclusion proof for one file within?

Sorry if some of these questions don't make sense, it's been a long time since I last played with merkle trees and I forgot all my basic intuitions around them.

makew0rld commented 6 months ago

Hello, thanks for checking out merkdir! ArchiveBox is a cool project and it'd be awesome to see this integration. I think using merkdir here is reasonable, with a few caveats.

Responding to your questions:

Vendoring binaries makes sense to me.
It's not currently possible to only update a few files in a tree. I would have to think a lot about the UX of this as merkdir was mostly designed around the idea of static snapshots of directories, where not updating is a feature.
Only the file content and a random nonce are used to produce the hash for each file. No file metadata.

How would you recommend storing the resulting hashes and proofs so that they're easily accessible next to the original files, without becoming part of the next hash-computing run themselves? (I'm imagining a .merkdir file in each subdirectory of our file tree containing the hash of everything from that point downwards, but I'm not sure if adding those files will itself cause the root hash to change and invalidate all the previous hashes a result. Can we just ignore dot-prefixed files when hashing?)

Good question. First off, all you need to store is the tree file. This contains the hash, but is also needed to generate inclusion proofs in the future. Storing proof files is not unreasonable, but also not necessary since they can be regenerated exactly from the tree file. Storing hashes is unnecessary because it is already in the tree file.

Currently there is no way to ignore files, that could be a useful feature. But it's worth noting that the way merkdir is designed, the root hash will change every time you generate one even if no content has changed, because of the use of random nonces. The tool is not supposed to generate a single immutable identifier for a directory of content, but to provide Merkle tree functionality over the set of files that existed at scanning time. Like being able to sign/timestamp a hash instead of all the files, or being able to generate inclusion proofs without releasing all your files.

So with that said, even though adding your tree files into the same directory is kind of messy, it doesn't really break any guarantees you had before. Because everything you need is stored in the tree file, the directory can be modified as much as you want and you'll still be able to generate inclusion proofs and verify specific files that haven't changed.

For cleanliness personally I would store the tree files in some separate area, named after the original folder or something.

Given a root hash, do you need access to the entire tree of original data (in the state it was in when the root hash was computed) to generate an inclusion proof for one file within?

Given only a root hash, merkdir can't really do anything. However given a tree file, you don't need access to the directory it was generated from. You don't even need the file the proof is being generated for.

Here's the command:

merkdir inclusion -t documents_tree.merkdir -f "name/of/file.txt" -o my_proof.merkdir

Specifying the name of the file doesn't look it up on the disk, it's just a user-friendly way of saying "give me file number 2,184".

The output file just contains the following struct (CBOR encoded):

type InclusionProof struct {
    LeafIndex uint64   // zero-indexed leaf number, from left to right
    TreeSize  uint64   // number of leaves
    Nonce     []byte   // Nonce for proven leaf
    Proof     [][]byte // Node hashes, in bottom-to-top order
}

All the information required for this struct is already stored in the tree file, so no other files are necessary.

Of course verifying the inclusion proof requires the file it's for, where we take the file hash and the struct above, then produce a root hash that will match the original tree root hash if everything is valid.

At the beginning I mentioned some caveats about using merkdir with ArchiveBox. The main thing I'd say is what I mentioned above, that merkdir is designed for static directories. It would be possible to design a Merkle tree tool that worked like an append-only log, where changes could be incorporated incrementally, and previous root hashes would still be part of the overall tree and verifiable. Probably I would do this with a Merkle Mountain Range.

But merkdir was not designed this way for simplicity reasons, and so this makes it more cumbersome to use in scenarios where files are changing often. You have to keep track of multiple tree files (and signatures or whatnot of their root hashes), and which proofs you've generated (if any) map to which tree files. This is especially important for timestamping, where you don't want to lose timestamp proofs for your older files just because new ones have come along.

Designing an append-only log Merkle tree tool would be cool, but it's not something I plan on doing currently.

pirate commented 6 months ago

Thanks for the detailed response! I'll dig into all of this and weigh the various options. Implementing something like this is still a ways away for me but I wanted to ask my questions early to get a sense of whether it's even a good idea / what I'll need to learn about before I dive in.

It's not currently possible to only update a few files in a tree

That's actually ok for our use-case, it just means I'd store a tree for each Snapshot (which is what we call an immutable collection of archive data from a certain website), instead of for a whole collection of Snapshots (which can change as entries get added/removed).

Only the file content and a random nonce are used to produce the hash for each file. No file metadata.

Great, that's perfect for our use-case too, since I may want to rename files in future versions for better UX (without invalidating previous hashes).

the way merkdir is designed, the root hash will change every time you generate one even if no content has changed, because of the use of random nonces

I would probably have to change this behavior to not use random nonces if we used it, since I would want hashes to be deterministically computable across different machines and installs.

You're welcome to close this issue for now (or leave it open if you think other people arriving here might have similar questions). Thanks again for the suggestions and for addressing everything in such detail!

makew0rld commented 6 months ago

I would want hashes to be deterministically computable across different machines and installs

Fair enough. Maybe this is something I could add in the future if ArchiveBox needs it.

Thanks for your interest in the project! Happy to talk more about it anytime, and perhaps I'll see you at DWeb Camp :)

makew0rld / merkdir

Integration with ArchiveBox and/or python in general #3