oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
245 stars 38 forks source link

[sled agent] Serialize Nexus notification queue #1917

Open smklein opened 1 year ago

smklein commented 1 year ago

There are many spots in the Sled Agent where we send notifications up to Nexus, such as:

https://github.com/oxidecomputer/omicron/blob/17ab9fdc69022ebfc8f72e85ddc4da1e658b6d1b/sled-agent/src/server.rs#L83-L123

These notification calling sites are spread around Sled Agent. For unrelated updates, this is no problem. However, for related updates, this can unfortunately result in "earlier updates" trampling over "later ones".

Example problems:

Proposal:

andrewjstone commented 1 year ago

I think one problem with this is that different requests can go to different Nexuses, and so the ordering may not be preserved. There's also the problem of queue sizing although that's likely minor here. We could instead use an atomic counter for updates and have nexus order based on those, by writing that counter to a DB. On sled-agent crashes we'd have to get the latest value of the counter back so we could continue to increment though.

smklein commented 1 year ago

I think one problem with this is that different requests can go to different Nexuses, and so the ordering may not be preserved.

I don't think this should matter - even if we're communicating with multiple Nexuses, we won't get a response until that Nexus has committed our message to durable storage (CRDB) which should be shared among different Nexuses. If awaiting that response blocks the subsequent request, we've successfully serialized our notifications.

There's also the problem of queue sizing although that's likely minor here. We could instead use an atomic counter for updates and have nexus order based on those, by writing that counter to a DB. On sled-agent crashes we'd have to get the latest value of the counter back so we could continue to increment though.

Yeah, this is true, and a bit more complicated. Not all notifications from the Sled Agent are equal, so it isn't necessarily right to "just ignore everything but the latest update" too

andrewjstone commented 1 year ago

Ah, ok. I guess I didn't realize you'd be waiting for the calls to get a reply, which I probably should have! Carry on!

davepacheco commented 1 year ago

I think one problem with this is that different requests can go to different Nexuses, and so the ordering may not be preserved. There's also the problem of queue sizing although that's likely minor here. We could instead use an atomic counter for updates and have nexus order based on those, by writing that counter to a DB. On sled-agent crashes we'd have to get the latest value of the counter back so we could continue to increment though.

If I'm understanding right, this is the pattern Nexus already uses elsewhere. For example: Nexus receives Instance runtime state notifications from the Sled Agent and writes them to the database. The runtime state has a generation number. The Sled Agent owns that state and so also owns the generation number. When Nexus writes the state to the database, it just does so conditional on the generation number in the database being older than what it's writing. This approach should correctly handle both duplicates (retransmissions) and out-of-order arrivals. This case is relatively simple though because we don't care that much about the contents of the runtime state in this code path -- we're just recording it. If we did need to do something based on what we find, it might be trickier to handle notifications out of order.

My intuition (and I suspect @andrewjstone's) is that trying to synchronize all notifications is likely to be more brittle than trying to gracefully handle them on the other side in whatever order they occur, even if that just means rejecting some and having them retried, like it sounds like we do today with the zpool/sled/dataset messages. The kinds of problems I'm imagining include: queue growing without bound, starvation of some later notification because of some problem with a previous notification that's causing the whole queue to be stuck, and having notifications pile up because of any transient Nexus/database issue such that the notifications at the beginning of the queue are old and stale (and maybe superseded by later ones).

andrewjstone commented 1 year ago

Those are great points. I wasn't caffeinated enough when I originally objected to think more deeply about my reasons. It was just intuition indeed. The problem of a lost reply is definitely an issue here. For instance, if we timeout and send up the next queued request, instead of waiting forever and building the queue backlog, the old request could be floating around in the network and get applied out of order. Generation numbers feel safer to me.

smklein commented 1 year ago

If I'm understanding right, this is the pattern Nexus already uses elsewhere. For example: Nexus receives Instance runtime state notifications from the Sled Agent and writes them to the database. The runtime state has a generation number. The Sled Agent owns that state and so also owns the generation number. When Nexus writes the state to the database, it just does so conditional on the generation number in the database being older than what it's writing. This approach should correctly handle both duplicates (retransmissions) and out-of-order arrivals. This case is relatively simple though because we don't care that much about the contents of the runtime state in this code path -- we're just recording it. If we did need to do something based on what we find, it might be trickier to handle notifications out of order.

The usage of a generation number makes sense to me when we care about the "latest event only" - if Sled Agent emits two different runtime states, this technique ensures that the latter one is received.

I think this example works with, for example, instance states, because:

  1. Resource Granularity: There is only a single property we care about (the state of one virtual machine)
  2. Last-Event, not All Events: We are okay with the last-one-wins policy

Sled Agent is currently notifying Nexus of a few different things, including:

I think that if we add a generation number for Sled Agent to use when reporting notification to Nexus, we'll actually need to add multiple generation numbers. For example...

I think an implementation like this would only work if we had multiple, distinct generation numbers for each resource where we're okay just taking the "latest" info.

For example, the following would be wrong:

This seems doable, but it will require more structure to the Sled Agent -> Nexus notification pathway. I'll poke around at this approach.

rmustacc commented 1 year ago

A few thoughts on this. While I don't know the overall structure of the nexus data, here are a few thoughts that come to mind:

That may not work, but if you assume that nexus is the one that actually knows what the state of the world is and think of this as sending a topology snapshot and its up to nexus to figure out the diff, that'll could possibly make sense. In particular, Event A in the zpool case is, I think, more complex assuming a disk pull or fault. It's actually more like:

The reason to think about this is that Nexus can crash at any time processing this and folks may have to retry. Conditional PUTs (whether with etags or something else) help here. But also when a sled comes back up or has batched up a lot of changes because it's offline, it may just want to send the whole thing.

Obviously reporting and processing all state and the delta can be more expensive, but given we should be able to encapsulate a lot of this in a single topology view of the world that we serialize that may be easier than trying to note these all. Also, there's the critical thing that if sled agent crashes it can't assume it has given nexus any information because it literally doesn't know and should send the entire state of the world to have it make sense again.

smklein commented 1 year ago

A few thoughts on this. While I don't know the overall structure of the nexus data, here are a few thoughts that come to mind:

  • Data can't be lost if sled agent is doing conditional PUTs with what it things the prior generation number is and if it fails, sending a whole update and computing what's changed.
  • In general, you're going to want to send the whole state to initialize things and then only apply delta as an optimization. If the above ever fails, you just send the whole current state.

Great point, and for a first cut, we should probably just punt on the optimization. I think sending snapshots of "portions" of the device tree to Nexus may be a good incremental step from the current "single-zpool-at-a-time" API - if Sled Agent just sends a view of "here are all the zpools I see, with their corresponding states", Nexus can infer the deltas.

This seems to favor the approach of

"Alternatively, we could have a generation number to track the set of "all zpools" on a sled, as a single unit."

That I mentioned in my comment above.

rmustacc commented 1 year ago

A few thoughts on this. While I don't know the overall structure of the nexus data, here are a few thoughts that come to mind:

  • Data can't be lost if sled agent is doing conditional PUTs with what it things the prior generation number is and if it fails, sending a whole update and computing what's changed.
  • In general, you're going to want to send the whole state to initialize things and then only apply delta as an optimization. If the above ever fails, you just send the whole current state.

Great point, and for a first cut, we should probably just punt on the optimization. I think sending snapshots of "portions" of the device tree to Nexus may be a good incremental step from the current "single-zpool-at-a-time" API - if Sled Agent just sends a view of "here are all the zpools I see, with their corresponding states", Nexus can infer the deltas.

This seems to favor the approach of

"Alternatively, we could have a generation number to track the set of "all zpools" on a sled, as a single unit."

That I mentioned in my comment above.

So the one difference between what you described and what I tried to describe, I think, was that you had a separate thing for each sub-resource, e.g. a separate thing for all zpools or all say NICs, or other resources. I was trying to say that there is one, single payload for all of the gimlet and then deltas from there where that optimization makes sense and can be applied linearlizability (with the scheme that @davepacheco and @andrewjstone described).

andrewjstone commented 1 year ago

@smklein You're examples about last-write-wins and the problems there are dead on, and indeed that problem can be solved with a queue. Unfortunately the queue may lead to other issues as mentioned here. I think @rmustacc makes a good suggestion overall and I don't really see much problem with conditional writes as a whole for sled-agent state.

From an implementation point of view, this issue mentions that the notification calling sites are spread across sled agent. We could collate them with a channel internally to a single task that then performs the conditional updates, so we don't do that repeatedly and end up with a bunch of optimistic concurrency failures at CockroachDB.