spacemeshos / SMIPS

Spacemesh Improvement Proposals
https://spacemesh.io
Creative Commons Zero v1.0 Universal
7 stars 1 forks source link

Version information metadata #75

Open lrettig opened 2 years ago

lrettig commented 2 years ago

Overview

It's desirable that we be able to attach version information metadata to certain data structures (such as blocks, transactions, ballots, proposals, and eligibility proofs), explicitly or implicitly (via, e.g., including a version ID in the preimage of a hash that's signed). This metadata will allow nodes to verify whether or not a given data structure is valid in a given context, and it will provide information about which version of the protocol other block producers are running.

Scope

The scope of this SMIP includes the design of the versioning system: how versions should be set and interpreted. It also includes how version information should be encoded into data structures. Versioning of other, independent protocols (P2P, Hare, beacons, etc.) is explicitly out of scope.

Goals and motivation

  1. Establishing a terminology and a common, global way to refer to different networks/chains, while preventing version conflicts across chains/forks
  2. Make it as easy as possible for downstream tooling (e.g., block explorers, deployment tools, wallets) to specify and connect to multiple Spacemesh-compatible networks/chains, and to differentiate data structures (e.g., blocks) from these different chains
  3. Cross-chain replay protection; clean network bifurcation (in case of a contentious hard fork)
  4. Make sure that transactions (targeting an older protocol/VM) aren't included in blocks where they would be invalid, or would behave otherwise than anticipated by the user or application that created them
  5. Have an on-chain record of which protocol version each block producer is running, and a way to zero the voting weight of those running old versions of the protocol (as an incentive to upgrade)

High-level design

We introduce a hierarchical three-part versioning system, akin to semantic versioning (semver) but with some important distinctions. Rather than adopting semver's MAJOR.MINOR.PATCH, we adopt the following convention: GENESISID.VMID.PROTOCOLID.

Prior art

Ethereum

Ethereum introduced a CHAIN_ID to transaction hashing and signing in EIP-155 for the purpose of cross-chain replay protection. The ID is an integer that is manually configured per network. A list of IDs is published at https://chainlist.org/ and maintained at https://github.com/ethereum-lists/chains/tree/master/_data/chains. We are not aware of versioning of any other data structures in Ethereum, including transactions, blocks, or smart contracts.

Algorand

Algorand implements a genesis hash which is explicitly included in every transaction.

Spacemesh

Proposed design

Definition

Each element of the version triad is a 20-byte value defined as follows:

Construction

Here's a proposed scheme for calculating these values for a live network:

Implementation plan

Version values may be implicitly or explicitly included in data structures (see notes above). To include a value implicitly, the object does not include an explicit GENESISID or VMID. Rather, the value is included implicitly via hash-and-sign: the data structure to be signed is first prepended by the version blob, then the resulting hash is signed and the signature is appended to the data structure (i.e., the ID is included in the hash preimage). This can be accomplished in a straightforward fashion by wrapping all hash function calls.

In addition to changing how the node calculates object hashes, in order to interpret these hash ID values, nodes will require two additional pieces of infrastructure:

  1. New hash IDs need to be gossiped and shared via P2P messages. A node needs to be able to request the set of all known IDs, or recent IDs, from peers, in order to be able to verify signatures.
  2. A node should maintain a table of known IDs. This will allow it to verify signatures and check whether peers and block producers are running the latest version it's aware of. Note that this table can also be used to compress data on disk: explicit 20 byte IDs (in ATXs) can be replaced with shorter pointers to entries in this table.

Questions

Dependencies and interactions

Stakeholders and reviewers

Testing and performance

While performance implications are expected to be minor, benchmarking of the proposed implementation should be performed to ensure that prepending version information to serialized data structures before hashing does not materially impact performance.

A thorough suite of unit tests should be specified to ensure that messages containing valid versions are accepted, and those containing invalid versions or no version are dropped (as are the peers that gossiped them).

avive commented 2 years ago

It does not explain how version information should be encoded into data structures.

Why not also address this in a section of this smip once the how is figured out?

avive commented 2 years ago

Regarding genesis - what we want to avoid is data from another network which pretends to be a different network, for example another mainnet that is using the same genesis tome but has different genesis net config. To clarify, the genesis net config should include all genesis accounts and network params - all the immutable params used by a node. Therefore, it is better to hash the whole genesis net config data and not just the genesis start time, as the goal is to basically exclude any fork which changes some of network config but keeps the network id and genesis time. For example, say I fork a mainnet, change any genesis config param such as genesis accounts, and launch it. Blocks and messages from nodes on the fork would be accepted by nodes in the original mainnet. Genesis time is just one of these params but is insufficient to mitigate situations / attack such as this.

The performance penalty should be negotiable as each node only has to hash the genesis data once - on first node session, keep this hash locally and use it to compare the genesis param in blocks, txs, messages and to add it to any block, tx or message it creates and shares.

avive commented 2 years ago

Regarding the p2p protocol versioning section - why not use exactly the same versioning scheme proposed here? It seems satisfactory and sufficient. A p2p protocol change can just cause an update to the 'protocol' field. In other words, the p2p protocol is part of the overall protocol and does not need to be identified as just a p2p protocol. If we want to support an update to p2p protocol then maybe it is better to add a p2p protocol part to the version so it has 4 parts and so we only have one concept of 'version' instead of two?

lrettig commented 2 years ago

It does not explain how version information should be encoded into data structures.

Why not also address this in a section of this smip once the how is figured out?

This is a lie, actually I did end up addressing this. Will update.

lrettig commented 2 years ago

it is better to hash the whole genesis net config data and not just the genesis start time, as the goal is to basically exclude any fork which changes some of network config but keeps the network id and genesis time

I disagree. Using a hash of genesis time alone is sufficient. The goal here is to prevent conflict, by default, in the case of distinct networks. There is effectively zero likelihood of two networks coincidentally having exactly the same genesis time.

Network A could "fork" (i.e., copy) network B's genesis time in order to have an identical message/hash format, but that's a form of attack, and adding more params to the hash couldn't prevent this attack. An attacker could always just modify the hash function.

lrettig commented 2 years ago

Regarding the p2p protocol versioning section - why not use exactly the same versioning scheme proposed here?

Because the p2p protocol and the protocol used to generate the canonical mesh are two totally different protocols. They needn't be coupled. In theory, you could use a totally different p2p protocol to construct the same mesh, or use the same p2p protocol to construct a totally different mesh (by, e.g., changing the consensus rules). If the p2p protocol changes, this should have no direct impact on the mesh protocol: it doesn't render blocks, transactions, hare messages, etc. obsolete. (In practice, a client running version n+1 of the p2p protocol could negotiate to communicate with an older client using version n of the p2p protocol--and from the perspective of the mesh, nothing would change.) And the opposite is also true: if we change the mesh protocol this doesn't necessarily mean we need to make any changes to the p2p protocol.

To put things in the context of Ethereum, it currently runs using devp2p but could also run over libp2p (as, indeed, eth2 does!).

avive commented 2 years ago

I disagree. Using a hash of genesis time alone is sufficient. The goal here is to prevent conflict, by default, in the case of distinct networks. There is effectively zero likelihood of two networks coincidentally having exactly the same genesis time.

I disagree with your disagreement or I guess we agree to disagree on this point :-)

I believe that the goal of this hash should be defined as follows: any two honest nodes that use the same hash, are ensured that they both use exactly the same network and genesis configs. This should be the goal of this hash - to give these honest nodes this guarantee. If we follow your logic then we don't need the hash at all as at can be faked by dishonest nodes. It makes no sense to define this hash as network config that honest nodes must agree on and exclude from it most config params.

Another way to think about it - I think it is important to address the case where another network decided to run with a config that has exactly the same genesis time as the first network, but different genesis params such accounts. e.g. we need to do what we can to protect the first network against blocks, messages and txs from the other network and vice versa. The more we do to protect these networks from each other the better, especially since this hash needs to be only computed once per node instance.

So, there is no zero-likelihood of 2 networks having the exact same network time like your ague above because the 2nd network just copied the first network confifg file and changed few things in it and we want to protect the fist network as much as we can reasonably can from the 2nd one.

Another way to think about this - I don't think that the goal of this has is just protection from incidental use. We want to provide strong guarantees wherever we can that intentional dishonest behavior of bad actors is explicit in code changes. If you operate a rough network and you just copy the genesis hash of an honest 1st network and use different network config file, then we can't do much about bad messages, blocks and txs on the 1st network originated on 2nd network, besides rely on the honest majority of nodes in the 1st network.

lrettig commented 2 years ago

This proposal has been updated to factor in the most recent R&D call on this topic. It's still open for comment. CC @noamnelke @tal-m @iddo333

lrettig commented 2 years ago

This proposal has been updated per today's R&D conversation, is complete, and is still open for comment.

noamnelke commented 2 years ago

New hash IDs need to be gossiped and shared via P2P messages. A node needs to be able to request the set of all known IDs, or recent IDs, from peers, in order to be able to verify signatures.

I remember we discussed this, but don't remember why... To be able to process a message with a certain version, the node must possess the code for that version, meaning that it's aware, at the hard-coded code level, of all previous versions. Why would we need to be able to validate the signature/hash of messages that we can't process? I remember Tal saying something about wanting to gossip those messages if we determine that the majority supports their version - but this doesn't make sense to me. We don't blindly gossip messages without validating them and we want to blacklist neighbors that send us invalid messages, so there's some accountability.

What if we want the protocol to support a range of versions, e.g., "the past 5 VM versions" in transactions, rather than just the current one? Can this be supported using hash-and-sign? A: not supported in hash-based scheme, and not needed.

Even though the ID itself is a hash, there's still a concept of "X previous versions" since each hash is a hash of the previous version concatenated with the new extradata. "Not needed" is a different story - we don't need to implement this right away, but remembering/documenting that it's possible, if needed in the future, is important IMO.

Unclear what happens if an ATX has a bad PROTOCOLID version.

Saying "unclear" sounds like we can find the answer to this and just haven't done this yet. I would say something like: "This value is for future use - to give us flexibility when implementing upgrades - so, by definition, we don't know if we'll need it or what for. For this reason, it's impossible to say what the required behavior will be."

at least 20 bytes, no more than 32

When I hear this, I actually hear "20 bytes" - so let's just put that in the SMIP.

What's missing for me

We don't have to solve this for this SMIP, but I think it should at least appear as an open question for future research: How do we prevent re-use of Smesher IDs across networks that share a genesis? We want to allow a smesher to choose between two forks without allowing them to use their proof-of-space data for both forks at the same time. The solution must allow smeshers that choose either side of the fork to do no additional initialization.

lrettig commented 2 years ago

How do we prevent re-use of Smesher IDs across networks that share a genesis? We want to allow a smesher to choose between two forks without allowing them to use their proof-of-space data for both forks at the same time. The solution must allow smeshers that choose either side of the fork to do no additional initialization.

The only solution I can think of is fraud proofs: if the same smesher submits an ATX on two forks in the same epoch, the ATX from one fork could be submitted to the other fork as a fraud proof that would invalidate the smesher's ID on that fork. Of course, the two forks would have to have different VMIDs and/or PROTOCOLIDs, and each fork would have to know the other's IDs to verify the fraud proofs.

lrettig commented 2 years ago

New hash IDs need to be gossiped and shared via P2P messages. A node needs to be able to request the set of all known IDs, or recent IDs, from peers, in order to be able to verify signatures.

I remember we discussed this, but don't remember why... To be able to process a message with a certain version, the node must possess the code for that version, meaning that it's aware, at the hard-coded code level, of all previous versions. Why would we need to be able to validate the signature/hash of messages that we can't process? I remember Tal saying something about wanting to gossip those messages if we determine that the majority supports their version - but this doesn't make sense to me. We don't blindly gossip messages without validating them and we want to blacklist neighbors that send us invalid messages, so there's some accountability.

Transactions with the wrong VMID may be discarded, not gossiped, etc. (and we may even be able to ban peers that send them to us), but ATXs with the wrong VMID need to be retained and their signatures still need to be validated. It's true that a node would know "at the code level" about older IDs, but it would not know about newer ones. That's why I suggested we'll need to gossip them, and to store them in memory in a table.

dshulyak commented 2 years ago

Why does the node need to get new ids from network, what it will do with them?

I remember that Tal was advocating to include some ID in ATX, purely for telemetry, it would make sense, but gossiping ids separately just doesn't. We cannot accept any random data from the network, data needs to be verifiably useful, otherwise, this will be just another ddos vector.

noamnelke commented 2 years ago

@lrettig

The only solution I can think of is fraud proofs

I think there's a way to solve some ickiness around fraud proofs if we do it in advance: we introduce a RetireID message that a smesher can sign to make their ID invalid starting at some epoch. The forked network can require smeshers converting from the old network to publish this message (referencing the old network) to be eligible on the forked one (new smeshers on the forked network can include a reference to the fork in their PoST data).

This is the same principle as a fraud proof (it invalidates the smesher) but it's tied to an epoch. With a fraud proof, there's a question about when has the fraud occurred and when does the ID stop being valid.

ATXs with the wrong VMID need to be retained and their signatures still need to be validated

ATXs explicitly include the PROTOCOLID, so validating them doesn't require any previous knowledge, even for versions the node doesn't know about (assuming that ATXs with an unknown PROTOCOLID should be considered valid).

lrettig commented 2 years ago

ATXs explicitly include the PROTOCOLID, so validating them doesn't require any previous knowledge, even for versions the node doesn't know about

Good point. A node receiving an ATX with a PROTOCOLID it doesn't recognize wouldn't know whether that's because the PROTOCOLID is too new (i.e., from a later app update that the node hasn't installed yet), or invalid, or some other reason, but it wouldn't matter as long as they retain all ATXs.

lrettig commented 2 years ago

@dshulyak @noamnelke how do you guys feel about this part?

New hash IDs need to be gossiped and shared via P2P messages. A node needs to be able to request the set of all known IDs, or recent IDs, from peers, in order to be able to verify signatures.

Would you prefer that ATXs include them? The current design is that the ATX just includes the PROTOCOLID which is:

PROTOCOLID := hash(VMID || PROTOCOLID_prev || protocol-extradata)

If we don't want to gossip/request the VMID separately, it would have to be separately and explicitly included in the ATX.

noamnelke commented 2 years ago

Posting the feedback I sent to @lrettig here:

I think this is more complex than it needs to be and wish we could avoid this. Specifically, it seems like all of this could be implemented the first time we upgrade something rather than in advance. This will both allow us to defer some complexity and, far more importantly, will allow us to pick the best and simplest solution, given the actual upgrade(s) we do (rather than trying to predict what we’ll need/do).

noamnelke commented 2 years ago

Moving the conversation here. @iddo333 pointed out the following as reason to implement this pre-genesis:

In theory, it’s debatable whether the signaling is needed at genesis (because we don’t have any update to perform yet, and an update implementation+voting are inseparable/atomic), but both Tal and I think that it’s required at genesis, for different reasons. Tal thinks that it’s needed so that honest miners will shut down (so that they won’t be used by an adversary somehow) if an update (that they didn’t vote for) succeeded. The other reason for signaling at genesis is that it’s crucial that the signaling would already be operational and verified on testnet, otherwise it’d be shaky to implement it when actually needed.

But I disagree. If we need old miners to stop mining on the new chain, a better way to do this is make their mining invalid (by making a small change to the ATX format). This does not rely on anything they do locally.

We're also NOT going to implement an upgrade voting mechanism before genesis. For the first year at least we're going to rely on social consensus for updates (in less fancy terms: the Spacemesh company will issue an update and ask the community to adopt it).

lrettig commented 2 years ago

Agree with @noamnelke that we won't have an upgrade voting mechanism, or signaling, before genesis. These can be added later as needed. Regarding whether we need any of this at genesis, even if we decide to hold off, we still need to figure out what to do with our hash functions. From https://github.com/spacemeshos/pm/issues/117:

@noamnelke do you think we can get away with just saying that the initial "version tuple" included in the hash function is nil?

Also, realistically we will probably need to upgrade things shortly after genesis, and we want to avoid a situation where such an upgrade would be delayed because the code wasn't ready, or tested. So this still feels high priority, even if it isn't strictly genesis critical.

noamnelke commented 2 years ago

do you think we can get away with just saying that the initial "version tuple" included in the hash function is nil?

Yes.

Also, realistically we will probably need to upgrade things shortly after genesis, and we want to avoid a situation where such an upgrade would be delayed because the code wasn't ready, or tested. So this still feels high priority, even if it isn't strictly genesis critical.

Yeah, we'll definitely need to upgrade the node shortly after genesis and we'll most likely need to upgrade the protocol (though hopefully not that quickly). Will such upgrade require having a version encoded into messages? Even if we need to change messages, I'm not convinced that the version needs to be embedded in them.

I strongly prefer to not add this mechanism (ever) unless it's strictly needed. We'll know best if it's avoidable when we have a specific upgrade in mind. If we end up needing to add this mechanism we will be much better positioned to implement it in a minimal way and test the specific case that we end up needing, rather than implement something generic now that would have to work in a variety of use-cases and then think of all the possible cases that need to be tested.

lrettig commented 2 years ago

Reviewing the motivation from above:

Establishing a terminology and a common, global way to refer to different networks/chains, while preventing version conflicts across chains/forks

I think we do need a way to make sure a node that synced, say, a testnet, doesn't connect to and try to sync mainnet. This suggests to me that, at genesis, at the very least, we do need some sort of chainID or genesisID. (Alternatively, we could try to address this at the P2P layer, but I think that would be a mistake, it could be spoofed, etc.)

Make it as easy as possible for downstream tooling (e.g., block explorers, deployment tools, wallets) to specify and connect to multiple Spacemesh-compatible networks/chains, and to differentiate data structures (e.g., blocks) from these different chains

Same--we expect our tooling will interface with both testnet and mainnet at genesis.

Cross-chain replay protection; clean network bifurcation (in case of a contentious hard fork)

Should not be an issue at genesis, but we will need to add this to prevent certain classes of attack

Make sure that transactions (targeting an older protocol/VM) aren't included in blocks where they would be invalid, or would behave otherwise than anticipated by the user or application that created them

Definitely not needed at genesis, and not until we perform a VM upgrade with breaking changes (e.g., changing gas prices). Even then this feels like a nice to have, as an old tx landing in a block post-hard fork is a corner case. This is mostly a UX nicety.

Have an on-chain record of which protocol version each block producer is running, and a way to zero the voting weight of those running old versions of the protocol (as an incentive to upgrade)

Definitely not needed at genesis, as discussed. We can add it later when we need it.

lrettig commented 2 years ago

Per my previous comment here, I think we'll need GENESISID by genesis to support multiple networks, including testnets, prevent data contamination/replay attacks, etc. We could leave out VMID and PROTOCOLID for now, and add them later, or we could implement them alongside GENESISID and set them to null for now. I'm not sure it's that much easier to implement just GENESISID than it would be to implement the full scheme now. Most of the epic laid out in https://github.com/spacemeshos/pm/issues/117 still applies.

lrettig commented 2 years ago

Notes from today's chat with @noamnelke: