spacemeshos / SMIPS

Spacemesh Improvement Proposals
https://spacemesh.io
Creative Commons Zero v1.0 Universal
7 stars 1 forks source link

Updates Part I: node auto-update #32

Open lrettig opened 3 years ago

lrettig commented 3 years ago

Requirements

Non-requirements

Design

Tasks

tal-m commented 3 years ago

I would separate this into two SMIPs, since there are actually two independent features here. The first is dealing with automatic node updates, and the second with protocol updates.

Node Update

The node auto-update procedure is agnostic to the update contents --- it consists of what you called "Phase I", but doesn't involve beacons or anything protocol-related. This is the mechanism that will also be used for emergency updates. The node auto-update should have a minimal grace period (during which the node operator can decide to veto an update) even for emergency updates. The reason is that veto power is intended, among other things, to mitigate an attack by an adversary who with access to the update signing key. In this case, the adversary could always claim the update is an emergency update, if that sidesteps the grace period.

I think it makes sense to have a much shorter grace period during the initial phase of mainnet (when emergency updates are much more likely), but in version 1.0 it should be long enough to allow for human response to an attack. (After 1.0, if there's a need for a faster emergency update, we'll have to rely on getting the word out to enough node operators who will override the grace period manually).

Protocol Update

The protocol update procedure describes the rules for when (and if) to switch to a new protocol version, based on the consensus about intent to update. This mechanism doesn't really care how the node is updated; it's relevant even if there are no automatic node updates at all. Essentially, the code for executing an updated protocol will always include the code for executing the current version of the protocol, and a decision mechanism that determines when to switch versions. (This would be the case even if the decision mechanism does not check on-mesh consensus --- e.g., "always switch to the new protocol version at layer 5000"). In terms of content, I think your "Phase II" sounds good.

lrettig commented 3 years ago

Hi @tal-m, thanks for the feedback.

I would separate this into two SMIPs, since there are actually two independent features here. The first is dealing with automatic node updates, and the second with protocol updates.

I generally agree with this. I think we can and probably should separate them as you propose. As written they are not completely independent since, e.g., I included the "protocol version signature" message in "Phase I", which is the vote that "Phase II" relies on.

Also agree with your points on grace period and veto power.

lrettig commented 3 years ago

@tal-m @avive @iddo333 I broke this out into three separate proposals: this one, for the node, #33, for the protocol (per Tal's suggestion), and #34, for the app.

y0sher commented 3 years ago

Looks good generally, some comments,

lrettig commented 3 years ago

the --testnet could probably ease it for some users, but essentially it is an hard-coded config file right?

I thought about it more like a shortcut to passing a bunch of other CLI flags (e.g., --network 123 --auto-updates ...) but I suppose you could think of it that way too. We'd have to work out precedence between it and the config file.

about versioning for client, core protocol, P2P protocol, at the past I thought this is necessary but I don't see a reason why not to version the whole thing as one piece

Well, they're just sort of fundamentally different things. E.g., no P2P protocol negotiation should be necessary between two nodes if one of them installs a new version of go-spacemesh that doesn't touch the P2P code.

also you mentioned something about downloading the update from a peer, this feels weird to me when we get informed about the new version from a centralized source, or we can write a decentralized protocol to negotiate versions with all your peers and decide which is the latest valid version and then download it from peers

This is tricky, I think we should not try to build this for 0.2. Downloading over HTTP is much easier.

if we assume mostly tech-savy users will use the go-spacemesh client without the smapp maybe we can not include auto-update and count on them to update manually and opt-in by doing this?

This isn't a bad idea. Especially if we expect that most users will be running smapp early on, and that smapp will be able to handle auto-updating go-spacemesh.

noamnelke commented 3 years ago

Most of these comments are minor, but I think the last point is very important.

Adding a --testnet flag

I don't think this is needed at all. There should be a distinct config file for each network anyway and this config file can enable auto-updates for the testnet. Opting out would be done by overriding the setting from the config file using a command line arg (or simply changing the config file).

Versioning

I don't think that every little thing should have its own version, but specifically the p2p should have a version which is negotiated as part of the handshake. Clients should each send the highest version they support - if the other party's version is too low (we dropped support for that version) the connection is terminated, and otherwise the lower value between the two should be used. This allows gradually updating the p2p protocol in a heterogeneous network, esp. for smaller changes, like encoding optimizations.

I actually don't think we should version the protocol at all. We should have a bit array in ATXs where miners can indicate support for specific upgrades to the protocol. I'm totally fine with making this a single byte since I can't imaging having more than 8 proposals up concurrently, but if someone thinks we should support 16 concurrent proposals then fine. These bits will always be zero, unless a proposal is up for a vote in which case we select a bit and use that for voting.

This enables a use-case where we have a proposal up for voting that would take effect in 2 months, and a month after voting started we want to propose another upgrade that would take effect a month later.

The problem with rolling both proposals into one number (or other value) would be that a miner that supports (as in their node has code that supports it) the first, but not the second proposal, wouldn't know about the second proposal and wouldn't know that miners who vote for it also vote for the first - making it impossible for that miner to correctly tell if they should activate the first proposal or not.

End of life

I'm against this proposal. A node should be able to operate indefinitely, unless something happens that prevents it.

While I don't like the idea of ever retiring a node proactively, I can live with what @tal-m suggested about nodes shutting down when they detect that the network accepted an upgrade that they don't have code support for.

But just dying of old age - what's the benefit in that?

Auto-update wait period

As discussed on the call, the minimal waiting period should be hard coded in the node and not shorter than 24 hours, imo.

I don't believe we should have a lower minimum for the testnet. The stakes are much lower in the testnet - no real money is at stake, only some hassle and perhaps reputation. So the tools available to us in mainnet should suffice.

Downloading code via P2P

I don't think this has any advantage. If we monitor updates via a URL we can also download via a URL.

If we feel that a URL is too centralized (I think that for the purpose of auto-updates it isn't) then let's start with announcing new versions over gossip instead of using a URL. Then we can include a bittorrent client library (like this one) and include trackers for the binaries of all supported platforms with the version published via gossip. This makes much more sense than implementing our own file transfer protocol. BUT I THINK WE SHOULDN'T DO EVEN THIS. URLs are great! CDNs for the binaries are awesome. Let's not fix what ain't broken.

Timing of upgrade

We should be careful with timing the actual upgrade of the node. If we push an update and exactly 24h later 2/3 of the nodes restart - this will kill the network.

At a minimum, miners should restart at a random time slot within a ~6 hour window, to give the first wave of upgraders a chance to sync and start mining again before starting another wave.

The predictability of the upgrade time is a serious security issue, IMO. During the upgrade window attackers know in advance that some nodes will miss their blocks and Hare messages and can take advantage. This is even more true when the upgrade is due to some disclosed security issue that's being fixed...

A more advanced (and safer) version is for miners to intelligently select the best time for them to upgrade, when they aren't eligible for any blocks or Hare participation (or at least are eligible for as few blocks and Hare votes as possible).

lrettig commented 3 years ago

The problem with rolling both proposals into one number (or other value) would be that a miner that supports (as in their node has code that supports it) the first, but not the second proposal, wouldn't know about the second proposal and wouldn't know that miners who vote for it also vote for the first - making it impossible for that miner to correctly tell if they should activate the first proposal or not.

I'm having some trouble understanding this. I guess I don't think about protocol updates as discrete "proposals" that can be adopted or not adopted independent of one another. In many cases, one proposal will depend on another and there may be a complex web of interdependencies. That's why I think it's better to think of a particular instantiation of the protocol as monolithic, and give it a unique, meaningless ID, like a hash or something.

But just dying of old age - what's the benefit in that?

It makes upgrades much easier. We know that, by a known point in time, all nodes running a particular, old version will have reached their "end-of-support" date and shut down (unless the user modified the source code and recompiled). Zcash has been using this to great success, see:

the minimal waiting period should be hard coded in the node and not shorter than 24 hours, imo.

We can debate the exact right number but my gut tells me around 72 hours. 24 feels too short because someone could, e.g., be on a long flight (or, you know, a meditation retreat ;) for that long and "not get the memo."

I don't believe we should have a lower minimum for the testnet

I agree strongly with the case you made on the call, @noamnelke: we should strive to operate the testnet with identical parameters to mainnet wherever possible.

URLs are great! CDNs for the binaries are awesome. Let's not fix what ain't broken.

Agree. If you're explicitly trusting a particular developer to notify you of updates, and to provide you with signed updates (see #36), that's already "centralized." You can always choose to "track another updates channel" (to borrow Linux terminology).

Timing of upgrade

Very good point. We could maybe key this on one of the existing beacons to do it securely and in an unpredictable fashion. As long as the upgrade happens before the protocol activation layer height, which of course all nodes do need to agree on!

brusherru commented 2 years ago

I'm a bit confused about the big amount of issues for kinda coupled things. I'd prefer to make a decision about the whole strategy and user workflows, and then decompose it to separate SMIPs / issues and dive into details. Anyway...

I've posted some of my thoughts (mainly related to Smapp, but not at all) here: https://github.com/spacemeshos/SMIPS/issues/34#issuecomment-1016130304 Please check it out.

In the rest, I have some thoughts specifically about updating the Node itself.

Centralization

I think this is one of the most important things that determines how we deliver updates. Since we have signed apps we already got a centralization — only we can introduce a new version. So in this case I don't see a reason to gossip about updates through p2p. However, I think that the idea of decentralization of it is a very good one. So we would not worry anymore about what will happen with the network if something happens with us. Or someone will block our domain, etc. But we will face a much more difficult task, how to deliver updates, how to trust them, and how to avoid vulnerabilities. But I don't think that these difficulties have the highest priority, for now, so we can use a centralized source of updates. But I think that we definitely should have a CLI flag to set a custom trusted source of the updates.

Gradual update of nodes

In case the Nodes will notify the network about the version that is used, I think we can try to make a gradual update by using the highest byte of sha256(NodeID + UpdateHash) to determine the "place in the queue". For example, use the highest byte to determine how much to wait from the moment when the update is available. 0 for "update immediately" and 255 for "update when almost everyone else updated", or group it by some range (E.G. 0-15 first in the queue, 16-32 the second, and so on). It means that it may tell the Node not just "wait N hours before update", but to check the network for some percentage of updated nodes. For example, if my node is the second in the queue, my node will wait until seeing some percentage (E.G. 10%) of the Nodes are updated on the network, then update and tell the network that my node is updated too. But there might be a possible vulnerability — some hacker might build his own node that would not tell the network that it is updated. If there will be a big bunch of such malicious nodes in the network it will block updating all of the nodes on the network. Also, I'm not sure how it will work for non-backward compatibility updates.

End-of-support

@lrettig can you sum up what is the benefits of such a solution? I hope that our network will grow and develop, but I can imagine that at some moment in the future we may not have an actual update of the Node, but we will still need to release a new version. I'd prefer to mark some older nodes as outdated not just by past time, but by existing of a bunch of newer versions (and probably a percentage of used versions on the network). For example, when

Grace period

I think this is a good idea. But:

Until we're a centralized source of updates, we can just check for the "patch" part in the semver and install it much faster than others. Have a 24-72 hours grace period for minor updates and some kind of "wait until NNN000 layer" (I mean next layer number that ends up with some zeroes) for major.

Auto-update flag and changing the mind

First of all, since the recommended way is to turn on auto-updates, I propose to name such flag as --no-auto-updates and it will turn it off. Secondly, what if we're running the Node and then decide to switch auto-updating on or off? In case we have only a CLI flag for it, we have to restart the Node. But it is desirable to not turn off a node at all. So maybe we need another way to handle it. For example, via API. But it also will face other questions. More about it I wrote in the mentioned comment in #34.

lrettig commented 2 years ago

I think this is one of the most important things that determines how we deliver updates. Since we have signed apps we already got a centralization — only we can introduce a new version. So in this case I don't see a reason to gossip about updates through p2p.

It's not true that "only we can introduce a new version." We need to allow people to fork our code and offer competing versions. There's nothing enshrined or special about the software released by the Spacemesh team, other than the fact that we're releasing the first version.

But I don't think that these difficulties have the highest priority, for now, so we can use a centralized source of updates. But I think that we definitely should have a CLI flag to set a custom trusted source of the updates.

I agree.

In case the Nodes will notify the network about the version that is used, I think we can try to make a gradual update by using the highest byte of sha256(NodeID + UpdateHash) to determine the "place in the queue". But there might be a possible vulnerability — some hacker might build his own node that would not tell the network that it is updated. If there will be a big bunch of such malicious nodes in the network it will block updating all of the nodes on the network.

This is a clever idea! I like the idea that not all nodes auto-update at the same time. I think we can work around the vulnerability you describe by using the highest byte to pick an update time, and remove the notion of a queue or of checking how many other nodes have already updated. E.g., cause nodes to auto-upgrade over a period of 24 hrs, and the exact time they perform the update within that window depends on their ID.

End-of-support

@lrettig can you sum up what is the benefits of such a solution?

I answered this above:

It makes upgrades much easier. We know that, by a known point in time, all nodes running a particular, old version will have reached their "end-of-support" date and shut down (unless the user modified the source code and recompiled). Zcash has been using this to great success, see:

Grace period

I think this is a good idea. But:

  • What if we face a critical error on the network and we need to fix it asap?
  • We should wait the same 24-72 hours as for other updates? Until then our network will be down and the price of SMH will fall three days in a row?

All of this only applies to auto-updates. Node operators always have the option of manually installing updates without waiting for an auto-update or a grace period. In practice, in case of a critical error, we'd need to communicate directly with node operators and ask them to update immediately.

First of all, since the recommended way is to turn on auto-updates, I propose to name such flag as --no-auto-updates and it will turn it off.

Defaults are important, and I think the default should always be not to auto-update (for governance reasons). We can recommend that users enable auto-update but I think it should be explicit opt-in.

Secondly, what if we're running the Node and then decide to switch auto-updating on or off? In case we have only a CLI flag for it, we have to restart the Node. But it is desirable to not turn off a node at all. So maybe we need another way to handle it. For example, via API.

Agree, we can add this to the API, it should be pretty straightforward.