Open aschmahmann opened 1 year ago
From the FRC abstract:
it turns out to be much more useful to be able to express the tuple of (Root hash, size of tree) than just the root hash.
This is how I think it should be done. The size doesn't really matter for the hash. The size information is additional context that is needed for some application. To me, this is not what Multihash is for.
I've discussed with a lot of people about how to transmit additional context with CIDs at the LabWeek/IPFSCamp in 2022. I think there should be a way, but I still believe it shouldn't be part of the CID, but something separate.
This is how I think it should be done. The size doesn't really matter for the hash. The size information is additional context that is needed for some application. To me, this is not what Multihash is for.
Yes/No. The situation as it currently stands makes referring to the data needed by CID not particularly useful. It was acknowledged that the v1 Piece CIDs are not a real CIDs which is why the codecs are labeled as filecoin
rather than ipld
(https://github.com/multiformats/multicodec/pull/161#issuecomment-620995138 and https://github.com/multiformats/multicodec/pull/172#issuecomment-620392474). This seems to be because (as mentioned in those issues) it wasn't clear why/how people were going to use these CIDs or if they'd be used outside of Filecoin internals. However, it turns out people now want to move around this data by CID and the current multihash + codec combination doesn't do that.
Fixing mistakes is IMO precisely what multiformats is for and realizing that Filecoin Pieces might need real Content IDentifiers (CIDs) so they can be used in other contexts (e.g. IPFS p2p transfer of data by piece CID) and that a new multihash is necessary to do so is the cost of learning (and the benefit of having multiformats).
I talked with various people about it, I now have a better understanding.
I've only thought about the producing side of things. You put in some data and you get a specific Merkle tree out. Though if you think about CIDs also as a way to retrieve the data, then the height matters. You would retrieve data and would want to verify that the data is the one you expected. Without the height, you could get a higher level of the Merkle tree (random bytes and not the data you expected) and it would still verify correctly.
This means that it now makes sense to me to include the height.
Should other applications of merkle-tree hashes lean on custom multihashes rather than reusing existing ones?
From an IPLD perspective (i.e. if these are multihashes intended for CIDs) then I think the answer would come down to what we are addressing. The original "bmt" entry we had, which I think was even thought to be useful for bitcoin binary merkle addressing, turns out to be not so useful because you're addressing the entire base set of data, which is not really what you typically want. From the IPLD perspective you typically want to navigate through links to individual pieces of these things.
For CommP, if we want it to address the entire base data being "hashed", then viewing the merkle process as a hash function itself makes more sense.
When we start doing things like inclusion proofs then it gets a bit more iffy. Does https://github.com/filecoin-project/FIPs/blob/master/FRCs/frc-0058.md muddy these waters a bit? Does data-segment neatly fold up into this new conception of CommP? Or do we keep those concerns entirely separate from this?
Also: you're going to have to work on table formatting (see CI) .. sorry but your long name is going to make that tricky, or you might have to shorten it.
And broadly this is +1 from me, just with some general questions about data-segment, and also the need to update the table formatting.
Can be merged in draft at least? We've been using this in web3.storage for a long time now.
can be merged when it's mergeable; table formatting needs fixing up at least
This PR reserves a single multihash code. The request here comes from the Filecoin ecosystem which is already using the underlying hashes here in relation to the codes reserved in https://github.com/multiformats/multicodec/pull/170 and https://github.com/multiformats/multicodec/pull/172.
However, it turns out the approach taken in the reservation of those codes was insufficient and as a result there’s a new spec proposal for a different multihash to better represent the data being referred to.
At a high level there were three approaches for describing these trees given here. We went with option 2, but that turns out to have been a mistake so now we’re going with option 3.
Why is this a good idea when you already had a code here?
You live and you learn. It turns out that for the particular data being dealt with the approach taken originally was problematic (and is IIUC largely why the related codec entries were tagged as
filecoin
rather than IPLD) and this one makes life easier. The reason CID and multihash exist is to make it easier to evolve through mistakes, this is one example of that.Should other applications of merkle-tree hashes lean on custom multihashes rather than reusing existing ones?
As with my comment around IPLD codecs from a while ago I think the answer lies in what the application gain/loses out on by not reusing existing multihashes.
Is it reasonable to have multiple codes dealing with effectively the same data?
Situationally, yes. In this case it turns out there are better and worse ways to describe the same data and people might even disagree on what those are. The table allows us to support these options.
A related example I’ve heard mentioned in the past was if someone wanted to register codes for the internal components of the blake3 tree so that data could be referred to as either a Blake3 multihash of a 1GiB file, or a more specific Blake3 internal-node multihash for the root of the tree whose leaves are the 1GiB file. On the surface this seems like a reasonable request as well.
cc @ribasushi @willscott @rvagg @vmx
(p.s. I hope Rod and Volker forgive me for making them relive #161, #170 and #172)