multiformats / cid

Self-describing content-addressed identifiers for distributed systems
Other
430 stars 79 forks source link

CIDv2 idea: include the heights of trees in the CID #62

Open EdSchouten opened 1 year ago

EdSchouten commented 1 year ago

The other week I was brainstorming whether it would be possible to use IPLD as a potential data storage/transfer format for a future version of Bazel’s Remote Execution protocol, most notably its Content Addressable Storage (CAS). See https://github.com/bazelbuild/remote-apis/issues/250 for details. Where this use case differs from the IPFS is that Bazel remote execution follows a more traditional client-server model. There is no immediate intent to use peer-to-peer sharing of objects.

One thing I was thinking about, was what an efficient algorithm for replicating a DAG from a client to a server (i.e., uploading source code to build), or from a server to a client (i.e., downloading build outputs) would look like. Considering that IPLD/IPFS relies on chunking more heavily compared to what Bazel does right now, it’d be pretty important for build clients/servers to use heavy parallelism to transfer such data across.

That said, you do want to place bounds to the amount of parallelism to prevent exhaustion in case of large data sets. It’s fine for a DAG to have large fanout, and it’s fine for a DAG to be deep. But if a DAG has large fanout and is very deep, then it might be necessary to limit the amount of parallelism traversing the DAG to avoid keeping too many partially replicated blocks in memory.

One piece of information that would be very useful to have to be able to implement a properly bounded parallel replication algorithm is tree height. If this information was attached to every link, then a replication algorithm could at any point make smart choices on how aggressively fan out when traversing.

Unfortunately, this information is not part of CIDv0/CIDv1, meaning that if a storage system using IPLD wants to use such information, it would need to track it out of band. Alternative, one could use IPLD with a custom link system, but that has all sorts of unfortunate implications, such as being unable to export build results into IPFS for archiving purposes.

My question is therefore whether a future version of CID, if ever created, could include tree height (in blocks) as well.

rvagg commented 1 year ago

Have a read through #49 and you'll get a sense for the kinds of things people want out of a possible CIDv2. Particularly the last comment where @vmx has a link to a hackmd writeup of a proposal he championed. Something a lot of people want is the ability to have something arbitrarily parameterised, and in that framework a "depth" would certainly be an option. Aside from the challenges that come from getting to a CIDv2, I think your biggest problem is that lack of verification that a "depth" parameter gives. Kind of like the Tsize value in a dag-pb block, which is used for UnixFS to say how big a file/dir is before loading it; you just can't prove it and you have to trust the producer of the data if you're going to use it for anything other than a "hint". At least with a codec, you can verify it when you get the next block. You can't verify anything relating to the whole DAG without getting the whole DAG. But, if it's something you need, then maybe an intermediate block as your root with that kind of metadata in it would be better? That's pretty common, even in Git, where you have a "root" which has metadata and also a hash of the "real root of the data". You have an extra block in the step, but you can fill it up with whatever hints you need for the consumer of the data. Kind of like the HAMT spec we have, similar to other data structures, where you encode parameters in the root block but the real meat of the data structure come when you load the next link.

EdSchouten commented 1 year ago

Hey Rod,

Have a read through https://github.com/multiformats/cid/pull/49 and you'll get a sense for the kinds of things people want out of a possible CIDv2. Particularly the last comment where @vmx has a link to a hackmd writeup of a proposal he championed.

Thanks! I'll read through that.

Something a lot of people want is the ability to have something arbitrarily parameterised, and in that framework a "depth" would certainly be an option. Aside from the challenges that come from getting to a CIDv2, I think your biggest problem is that lack of verification that a "depth" parameter gives. Kind of like the Tsize value in a dag-pb block, which is used for UnixFS to say how big a file/dir is before loading it; you just can't prove it and you have to trust the producer of the data if you're going to use it for anything other than a "hint". At least with a codec, you can verify it when you get the next block. You can't verify anything relating to the whole DAG without getting the whole DAG.

So just to double check: with "depth" you mean counting the distance of a block with respect to the root? Yes, I agree that that's a property that you can only validate by traversing the entire DAG, as you can't rule out that a block isn't referenced by other parts of a DAG, which may affect the depth.

In this specific case I'm interested in tracking the "height", which is the length of the longest downward path to a leaf from a given block. It's the inverse. As far as I know, that information can be verified by only considering the contents of a single block:

$H{parent} = \max (0, H{child1} + 1, H{child2} + 1, H{child3} + 1, ...)$

If such a check is performed for every block in the DAG independently (similar to validating the hash), then you can be certain that the height of the overall DAG is correct as well.

But, if it's something you need, then maybe an intermediate block as your root with that kind of metadata in it would be better?

In my case I'm explicitly interested in having it part of the CID, so that I can come up with a schema independent way of efficiently replicating DAGs.