multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
336 stars 201 forks source link

Reserve multihash type for ssz-sha2-256 #266

Closed i-norden closed 2 years ago

i-norden commented 2 years ago

This introduces an entry to the registry for a multihash that represents a SSZ Merkle root using a SHA2-256 hash function. SSZ is a serialization and Merkleization process that is used extensively in the Ethereum Beacon chain.

The SSZ specification is described in detail here: https://github.com/ethereum/consensus-specs/blob/dev/ssz/simple-serialize.md#simpleserialize-ssz.

rvagg commented 2 years ago

So, you're wanting to look at ssz not as a serialization format (which I think would get us into similar problems as protobuf does) but to identify the result of a a merkle root over ssz encoded data, with sha2-256 as the actual hash function? So it's kind of like a double-layer hash process (where sha2-256 is already doing its own internal form of merklization!).

It feels a bit weird; we do have bmt ("binary merkle tree") as a sort of precedent (although I'm not aware of any active use of it and am tempted to propose removing it because it stretches definitions a bit much IMO).

But, are we stuck with the schema problem though? Can we define ssz-sha2-256 as a hash function that you could run/implement that takes a byte array and produces a fixed length (32) byte array output? Or does it still need additional data to run (i.e. a schema over the data)?

I suspect the problem you're up against is that you have the root, but can't define intermediate blocks through the merkle tree to the leaves, so have no way to create linked IPLD blocks between that root and the leaves. And you can't just use sha2-256 because it's really not a sha2-256(bytes) where bytes is the entire base content, so you need to define something novel. And you're probably going to have to add yet another eth-* codec here to describe a codec that gets you from a merkle root to all the leaves in one jump.

I think that still leaves us with the problem that ssz-sha2-256 isn't something that can be used independently, it's strictly tied to the data that it's "hashing"; it's kind of a eth-beacon-ssz-sha2-256 isn't it? How would you be implementing this function in code to plug in to go-ipld-prime such that it's usable according to the go-multihash interface?

i-norden commented 2 years ago

Thanks for the review @rvagg , as always this is really helpful and these questions really get at the heart of the problem.

So, you're wanting to look at ssz not as a serialization format (which I think would get us into similar problems as protobuf does) but to identify the result of a a merkle root over ssz encoded data, with sha2-256 as the actual hash function? So it's kind of like a double-layer hash process (where sha2-256 is already doing its own internal form of merklization!).

Precisely, although to be clear the block binary referenced by a ssz-sha2-256 multihash would be the SSZ serialization of some object. So, in order to produce the multihash (root) from the referenced binary one would need to SSZ decode the binary and then provide the SSZ decoded object to the SSZ Merkleization algorithm: ssz_sha2_256(ssz_decode(block_binary)). I would propose that this SSZ decoding step is stipulated as part of the "hashing" algorithm specified by the ssz-sha2-256 multihash type, so in that sense it would still just be ssz_sha2_256(block_binary).

A major challenge in this discussion is that the SSZ specification is general purpose and schema-agnostic- as far as I can tell- but the existing implementations are not. The SSZ serialization and deserializaiton specifications provide algorithms for taking any fixed length or variable length heterogenous or homogenous composite object and encoding it to a byte string, and vice-versa. Similarly, the SSZ Merkleization specification provides an algorithm for taking any fixed length or variable length heterogenous or homogenous composite object and splitting it into byte chunks, storing those chunks in leafs in a Merkle tree, and calculating the root for this Merkle tree.

We would still need to register multicontent types for e.g. the BeaconBlock and BeaconState, but this is mostly so that we can provide additional typing to the links they contain (this would perhaps not be needed at all if every hash in these objects was a ssz-sha2-256 root but that is not the case e.g. we also have 32 byte keccak256 hashes that reference eth1 data structures). Without knowing that some binary represents a BeaconBlock, using the SSZ deserialization algorithm (as specified) it would be possible to decode that binary into a heterogenous, variable length composite type- e.g. we could decode it into a []interface{} where each interface{} concretely is either a basic type or some other composite type we can recursively resolve to a set of basic types. In this way it is analogous to RLP encoding/decoding- we could RLP decode an EthHeader IPLD block into an []interface{} without any additional knowledge of its type (and this []interface{} would round-trip back to the same binary), but using the EthHeader codec we can decode it into an IPLD node object where the 32 bytes for the state trie root, receipt trie root, tx trie root, and uncle hash are replaced with CIDs that include the keccak256 multihash prefix and the appropriate content type prefixes.

It feels a bit weird; we do have bmt ("binary merkle tree") as a sort of precedent (although I'm not aware of any active use of it and am tempted to propose removing it because it stretches definitions a bit much IMO).

It is weird, and it definitely stretches the definitions but I think it is actually better specified than bmt which provides no specification for the type of hash function used in the Merkle tree nor does bmt (to my knowledge) provide an algorithm for how the referenced binary is chunked across leaf nodes in the bmt.

But, are we stuck with the schema problem though? Can we define ssz-sha2-256 as a hash function that you could run/implement that takes a byte array and produces a fixed length (32) byte array output? Or does it still need additional data to run (i.e. a schema over the data)?

In practice we are, but in theory we are not 😅 In theory we could write a SSZ Merkleization function that is capable of producing a proper 32 byte root given any arbitrary byte input. This ultimately hinges on having a SSZ deserialization process that is capable of decoding that arbitrary byte input into a typed object (which can then be readily Merkleized). Such a generic deserialization algorithm is defined, but as far as I can tell no implementation exists. Edit: we aren't limited by schema, a general purpose codec and Merkleization implementation has been demonstrated, but will need to be revived: https://github.com/prysmaticlabs/go-ssz/blob/master/ssz.go.

I suspect the problem you're up against is that you have the root, but can't define intermediate blocks through the merkle tree to the leaves, so have no way to create linked IPLD blocks between that root and the leaves. And you can't just use sha2-256 because it's really not a sha2-256(bytes) where bytes is the entire base content, so you need to define something novel. And you're probably going to have to add yet another eth-* codec here to describe a codec that gets you from a merkle root to all the leaves in one jump.

I think the problem is more-so similar to the one we run into with the Block PartSetTree in Tendermint, wherein due to how values are chunked across leaves in the tree there is no way to make sense of individual leaf values. We could IPLDize every node in the SSZ Merkle tree (although there would be significant practical/performance limitations/complications to this), and we would be able to navigate from the root down to a leaf, but there's no guarantee that once we get to that leaf that we can make sense of the value stored in that leaf. It's possible the bytes stored in that leaf don't map to a value at all, but are instead only a segment of the bytes that encode a value with the rest of the segments spread across other leafs.

I think that still leaves us with the problem that ssz-sha2-256 isn't something that can be used independently, it's strictly tied to the data that it's "hashing"; it's kind of a eth-beacon-ssz-sha2-256 isn't it? How would you be implementing this function in code to plug in to go-ipld-prime such that it's usable according to the go-multihash interface?

I think we can, the specification as I read it (the SSZ spec has been a major pain point in the coordination/alignment of the all the beacon chain teams...) suggests this is entirely possible. But I would like to take some more time to explore and discuss this and ideally I would throw together a POC but am not sure what the timeline will look for that. Edit: It can be used independently, an example implementation of this is https://github.com/prysmaticlabs/go-ssz/blob/master/ssz.go.

This question is sort of getting ahead of things but if we can come to a consensus on this and get ssz-sha2-256 accepted into the registry, does the specific byte prefix reserved here work or would you recommend reserving in some other range? I ask because we would like to start using this multihash byte prefix in our blockstore, if this multihash never get upstreamed at all that is fine and we will change it/find a work around but if it does get upstreamed it would be nice to know now that we wouldn't need to change it.

i-norden commented 2 years ago

@rvagg there actually is/was a general purpose codec and Merkleization Go implementation: https://github.com/prysmaticlabs/go-ssz/blob/master/ssz.go. It was archived in favor of https://github.com/ferranbt/fastssz which trades the generality for performance.

i-norden commented 2 years ago

I went ahead and also added a serialization type for SSZ, as it is possible that you an SSZ serialized IPLD block that is hashed and referenced by some other mechanism than the SSZ Merkleization process.

rvagg commented 2 years ago

Yeah, so this is stretching things but I think maybe it's stretching in a good way that's allowing for some experimentation around the edges of what multiformats does.. which we kind of need because we keep on bumping into problems—many of which are clear according to our current understanding of what multiformats is, but some of which, like this, are in a grey area.

On balance, I think that while this is grey, it's not too much of a departure from a standard understanding of multiformats, and we're in a high enough range here that I think we can probably just add these and let you start experimenting and report back how you go with codecs and hashers and how that all fits into the various systems—some of your description makes me wonder how hard it's going to be to integrate into go-ipld-prime's LinkSystem, but also that you've probably already thought through that. If you can't make it fit, then that's a strong suggestion that maybe this isn't a good fit for the table in its current form. But if you can, and it's not just a nasty nasty hack, then maybe that's OK :shrug:.

We should certainly come back to this at some point in the near future when we're drafting up some better docs for multiformats, your input given your experience with these edges would be valuable.

We would still need to register multicontent types for e.g. the BeaconBlock and BeaconState, but this is mostly so that we can provide additional typing to the links they contain

This is my biggest concern though - we spend a lot of time telling people not to use codecs as type signalling mechanisms. It's another very grey area though because sometimes, as you've outlined, you have no choice because you're navigating through a structure and you maybe could decode something in a generic way but it really should be specific, and you have no other means of identifying that. bitcoin-witness-commitment is such a case for me, it's just [Bytes,Bytes] which could be decoded as a mid-merkle tree bitcoin-tx but it's really something like [&BitcoinTx, Nonce], entirely different, just the same byte layout. But raw is the other obvious example - everything can be raw until you step up a type level, although we tend to treat that one as a special case.

So wrt that problem, I'd just ask you to be conscious of this and as much as possible avoid using codec codes as easy context signalling mechanisms. The pure form of multiformats is that codec code should just tell you the bit of code you reach for to turn the bytes into data model form and back again, and not much (anything?) else. The places where this is unavoidable are instructive for us about the limitations of multiformats though, so again this is good learning and you should be a good contributor to further doc & spec development here based on that experience.

Lastly, if you find yourself later on of not needing the multihash entry here because experience has taken you on another path, it would probably be best to come back and remove it to clean up. (Speaking of which, I think I'm going to finally propose removing bmt).

i-norden commented 2 years ago

This is my biggest concern though - we spend a lot of time telling people not to use codecs as type signalling mechanisms. It's another very grey area though because sometimes, as you've outlined, you have no choice because you're navigating through a structure and you maybe could decode something in a generic way but it really should be specific, and you have no other means of identifying that. bitcoin-witness-commitment is such a case for me, it's just [Bytes,Bytes] which could be decoded as a mid-merkle tree bitcoin-tx but it's really something like [&BitcoinTx, Nonce], entirely different, just the same byte layout. But raw is the other obvious example - everything can be raw until you step up a type level, although we tend to treat that one as a special case.

Thanks Rod, will definitely try to be conscious of this. I still make the mistake of considering an IPLD multicodec as being mapped to a specific type, but I think I can blame this on working on IPLD pretty much solely in the context of Ethereum and Tendermint/Cosmos where there is no way (as far as I can see) of properly decoding their IPLD block binary into IPLD models without mapping each distinct node type in their DAGs to a distinct IPLD schema type using an IPLD multicodec. This is because the necessary typing information isn't embedded in the binary, as opposed to e.g. dag-cbor where full CIDs- rather than raw hashes- are embedded and flagged in the binary.

If we were to choose to embed this information into the binary when loading it into the blockstore we would change the binary, changing the content hash => content mappings such that the content hashes don't match those found in the native DAG (unless we do something like what we propose for the TypedProtobuf scheme, where we implement multihash functions that can ignore "type decorations" embedded in the referenced binary).

I need to refresh my understanding of the bindnode pkg capabilities, but I think even when using it this problem persists. It makes it so that we no longer need to implement a custom codec (well, I think it is technically still a custom codec even if it all it does is wrap a SSZ deserialization step followed by a call to Wrap with a specific schema type???) but we still need to map to a specific schema type. For the BeaconChain data structures for example- this new ssz-sha2-256-bmt multihash alone tells us how to decode the referenced binary into a []interface{} with concrete types underpinning each interface{} (in some cases recursively). This can be round-tripped back to the block binary without issue. We should be able to then use bindnode to convert that []interface{} into an IPLD model, without providing a schema type it should be able to create a rudimentary representation tuple IPLD model (but not a representation mapping because it has no way of inferring map keys without struct field names or schema type). But without providing a corresponding schema type, it has no way of determining with certainty which interface{}s correspond to Links let alone what they link to (it could make some assumptions like, "any interface{} that assets to 32 length []byte{}" but that's problematic because not all 32 length []byte{}s will be Links and not all Links will be 32 length []byte{}. If we provide an IPLD schema type to bindnode as well as the []interface{}, it should be able to figure out which fields correspond to Links (and even the type of those links and what schema types to use to represent the linked objects if we used SpawnLinkReferences when creating the schema type.

I hope I am not too far off the mark with the above 😅 would be great to schedule a time to talk about these things if you have the availability. I was about to open PRs for the BeaconChain and Tendermint/Cosmos multicodecs, but your comment is giving me some pause.

rvagg commented 2 years ago

Yeah, that sounds about right, it all boils down to the schema - and the same logic applies to codegen too, the structure is in the schema.

And it's a very similar issue to what we have with the Filecoin chain. There's plenty of complex structs in there, but there's no maps at all in the chain, they're all tuples, so a base datamodel view is just arrays of junk (although we do at least get CIDs since it's dag-cbor) and really not that interesting to look at. Only by applying a schema can you see useful data in there.

And yes, I have a lot of sympathy for the case where typing information is absent and you can't infer by any other means.. It's really not how CIDs are supposed to be used but I think this is an edge where we don't have good answers. This gets really hard when you want to be able to consider blocks as independent units (I use explore.ipld.io as an example of this—if I put a CID in and get a block, what's it going to make of it without any additional information??). But of course most blocks exist within a system - so a filecoin block in the IPLD explorer looks like garbage, but within the context of filecoin I can make sense of it—but even there, I have to have additional context of where the block is in relation to other blocks to make sense of it (I can't just say "this is a Filecoin block, so I can read useful stuff out of it). This is the same pattern we see repeated - you have context of "I am in system X", and then you layer additional context on the path to reading a particular block which builds up over the navigation path so you can say "I am in system X, and I'm reading block Y that's in position Z so the data I'm seeing is useful".

This "context" think is an interesting topic I keep on meaning to write about because it's so important, but hard to grasp or explain well.

Good luck though and I look forward to seeing the fruits of your further delving into these codecs.