oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
244 stars 36 forks source link

Possible race between HardwareManager and bootstore initialization #3815

Open jgallagher opened 1 year ago

jgallagher commented 1 year ago

When BootstrapAgent configures and starts the bootstore, it calls a couple of functions that themselves call StorageResources::all_m2_mountpoints(). The helpers will fail if no M.2 mount points are returned, but only require that at least one is present.

Immediately prior to configuring and starting the bootstore, BootstrapAgent waits for the storage manager to know about M.2 we booted from. However, we do not (and cannot in the general case, I think?) wait for the other M.2 to show up (since we don't even know if it's present / functional). If the other M.2 is physically present but storage manager doesn't find out about it until just after BootstrapAgent configures the bootstore, the bootstore will proceed using only the boot M.2; there's currently no way to come back later and add/reconcile the other M.2.

This isn't specific to the bootstore - it's just the first thing that tries to access the M.2s after BootstrapAgent has found the boot M.2. Reading from the Ledger (immediately after the bootstore is started) also uses all_m2_mountpoints, so is presumably subject to the same issue.

andrewjstone commented 1 year ago

This is a general problem with replicated storage only consisting of two replicas. The later M.2 could come online and have different data on it, at the same version. With three M.2s we could always only write new data if two writes to M.2s succeeded and bump the version locally. Then if only 2 come up on next boot we read from both and write the later of the two versions back to both.

However, this is only a problem when trying to bump versions locally (from one sled-agent), since consensus is impossible here with only 2 M.2s. It is solvable in our system via always ensuring that data is only updated and versions bumped on Nexus and then using an RPW to ensure M.2s are consistent even if they alternate boots. There's an open issue for this.

Since there is in our current deployments, no actual way to update the bootstore outside of the safely replicated EarlyNetworkConfig which already will only get bumped from nexus via an RPW, this is generally safe now as long as both initial M.2s get the initial configuration. There's no way to overwrite it, so they'll just keep reading the only possible version regardless of if one or two nodes boot up. If, for some reason, one node doesn't have the latest EarlyNetworkConfig, that node will learn it when it connects to other nodes and write the updated value.

This could become a problem for lrtq when a sled is added to a cluster (once implemented), as key share handout is only tracked on individual sleds. So a sled could theoretically lose track of a share it gave out already to a learner if only one M.2 saved that information, and then the other booted without it later, and the original never came back. However, this would be harmless in the common case as only one of two things could happen:

  1. The same share gets handed out to a different learner - It will just take an extra share to be gathered for the rack to unlock (K+1 instead of K) in the case that both learners with the same share respond to the same LoadRackSecretRequest within the first K(+1) replies. If this happened to N-K sleds, then we could lose all LRTQ redundancy.
  2. No new learner comes online and asks the specific problematic sled for a share and so nobody notices at all.

Now, it must also be emphasized that this is only a problem with the LRTQ. The design for the real trust quorum is to perform reconfigurations via Nexus, including handing out key shares to new sleds. With proper encryption, and full replication of shares, we can always recover with an RPW, similar to other ledgers.

andrewjstone commented 1 year ago

It should be noted that we don't even need an RPW either for the real trust quorum, as we can just replicate the latest version of data to nodes that are supposed to have it. In fact, this is the more likely solution, but the RFD for this protocol has not yet been updated to reflect his behavior.

andrewjstone commented 1 year ago

The same share gets handed out to a different learner - It will just take an extra share to be gathered for the rack to unlock (K+1 instead of K) in the case that both learners with the same share respond to the same LoadRackSecretRequest within the first K(+1) replies. If this happened to N-K sleds, then we could lose all LRTQ redundancy.

We could additionally, fix this with some support scripts, if we could detect it. But again, I'd prefer to just build the real TQ.