[sled agent] Serialize Nexus notification queue

smklein commented 1 year ago

There are many spots in the Sled Agent where we send notifications up to Nexus, such as:

https://github.com/oxidecomputer/omicron/blob/17ab9fdc69022ebfc8f72e85ddc4da1e658b6d1b/sled-agent/src/server.rs#L83-L123

These notification calling sites are spread around Sled Agent. For unrelated updates, this is no problem. However, for related updates, this can unfortunately result in "earlier updates" trampling over "later ones".

Example problems:

If we attempt to notify Nexus about a new sled - as a "Gimlet" - but later try to notify Nexus about the sled being updated to a "Scrimlet", then the ordering of the two notifications is critical. It's important that Nexus eventually sees the sled as a Scrimlet, and if a retry loop happens to mean that the "see sled as gimlet" notification arrives last, we'd end up in an inconsistent state.
We can try to notify Nexus about datasets before zpools, and about zpools before the sled itself. This results in a "failure + retry", which is fine, but produces some confusing log messages.

Proposal:

Create a more broadly-usable "notification queue", where calls to nexus may be serialized.

andrewjstone commented 1 year ago

I think one problem with this is that different requests can go to different Nexuses, and so the ordering may not be preserved. There's also the problem of queue sizing although that's likely minor here. We could instead use an atomic counter for updates and have nexus order based on those, by writing that counter to a DB. On sled-agent crashes we'd have to get the latest value of the counter back so we could continue to increment though.

smklein commented 1 year ago

I think one problem with this is that different requests can go to different Nexuses, and so the ordering may not be preserved.

I don't think this should matter - even if we're communicating with multiple Nexuses, we won't get a response until that Nexus has committed our message to durable storage (CRDB) which should be shared among different Nexuses. If awaiting that response blocks the subsequent request, we've successfully serialized our notifications.

There's also the problem of queue sizing although that's likely minor here. We could instead use an atomic counter for updates and have nexus order based on those, by writing that counter to a DB. On sled-agent crashes we'd have to get the latest value of the counter back so we could continue to increment though.

Yeah, this is true, and a bit more complicated. Not all notifications from the Sled Agent are equal, so it isn't necessarily right to "just ignore everything but the latest update" too

andrewjstone commented 1 year ago

Ah, ok. I guess I didn't realize you'd be waiting for the calls to get a reply, which I probably should have! Carry on!

davepacheco commented 1 year ago

I think one problem with this is that different requests can go to different Nexuses, and so the ordering may not be preserved. There's also the problem of queue sizing although that's likely minor here. We could instead use an atomic counter for updates and have nexus order based on those, by writing that counter to a DB. On sled-agent crashes we'd have to get the latest value of the counter back so we could continue to increment though.

If I'm understanding right, this is the pattern Nexus already uses elsewhere. For example: Nexus receives Instance runtime state notifications from the Sled Agent and writes them to the database. The runtime state has a generation number. The Sled Agent owns that state and so also owns the generation number. When Nexus writes the state to the database, it just does so conditional on the generation number in the database being older than what it's writing. This approach should correctly handle both duplicates (retransmissions) and out-of-order arrivals. This case is relatively simple though because we don't care that much about the contents of the runtime state in this code path -- we're just recording it. If we did need to do something based on what we find, it might be trickier to handle notifications out of order.

My intuition (and I suspect @andrewjstone's) is that trying to synchronize all notifications is likely to be more brittle than trying to gracefully handle them on the other side in whatever order they occur, even if that just means rejecting some and having them retried, like it sounds like we do today with the zpool/sled/dataset messages. The kinds of problems I'm imagining include: queue growing without bound, starvation of some later notification because of some problem with a previous notification that's causing the whole queue to be stuck, and having notifications pile up because of any transient Nexus/database issue such that the notifications at the beginning of the queue are old and stale (and maybe superseded by later ones).

andrewjstone commented 1 year ago

Those are great points. I wasn't caffeinated enough when I originally objected to think more deeply about my reasons. It was just intuition indeed. The problem of a lost reply is definitely an issue here. For instance, if we timeout and send up the next queued request, instead of waiting forever and building the queue backlog, the old request could be floating around in the network and get applied out of order. Generation numbers feel safer to me.

smklein commented 1 year ago

If I'm understanding right, this is the pattern Nexus already uses elsewhere. For example: Nexus receives Instance runtime state notifications from the Sled Agent and writes them to the database. The runtime state has a generation number. The Sled Agent owns that state and so also owns the generation number. When Nexus writes the state to the database, it just does so conditional on the generation number in the database being older than what it's writing. This approach should correctly handle both duplicates (retransmissions) and out-of-order arrivals. This case is relatively simple though because we don't care that much about the contents of the runtime state in this code path -- we're just recording it. If we did need to do something based on what we find, it might be trickier to handle notifications out of order.

The usage of a generation number makes sense to me when we care about the "latest event only" - if Sled Agent emits two different runtime states, this technique ensures that the latter one is received.

I think this example works with, for example, instance states, because:

Resource Granularity: There is only a single property we care about (the state of one virtual machine)
Last-Event, not All Events: We are okay with the last-one-wins policy

Sled Agent is currently notifying Nexus of a few different things, including:

What hardware exists on the current Sled (this is currently the is_scrimlet field, but presumably it'll expand to also include driver information)?
What Zpools exist?
What Datasets exist (though this should get flipped, to be Nexus-driven)?

I think that if we add a generation number for Sled Agent to use when reporting notification to Nexus, we'll actually need to add multiple generation numbers. For example...

If a Sled reports it's status ("I exist, I don't see /dev/tofino, so I think I'm a Gimlet") and then later sees the tofino driver ("I exist, I do have /dev/tofino, so I'm a Scrimlet"), the latter event should always eventually persist over the former event.
If a Sled reports a zpool was added ("I see a new zpool with UUID = X") and then later sees the zpool was removed ("I do not see a zpool with UUID = X"), it should settle on the latter event.

I think an implementation like this would only work if we had multiple, distinct generation numbers for each resource where we're okay just taking the "latest" info.

For example, the following would be wrong:

Sled state vs Zpool state
- Sled reports it exists as a gimlet (gen # = 1)
- Sled reports a zpool was added (gen # = 2)
- Sled reports it exists as a scrimlet (gen # = 3), because the tofino driver loaded
- If event 3 arrives before event 2, and causes event 2 to be discarded, that's a bug - these are different resources, and should be using different generation #s.
Multiple Zpools
- Sled reports zpool A was added (gen # = 1)
- Sled reports zpool B was added (gen # = 2)
- Sled reports zpool A was removed (gen # = 3)
- If event 3 arrives before event 2, and causes event 2 to be discarded, that's a bug - each zpool is a distinct resource, and needs a distinct generation number. Alternatively, we could have a generation number to track the set of "all zpools" on a sled, as a single unit.

This seems doable, but it will require more structure to the Sled Agent -> Nexus notification pathway. I'll poke around at this approach.

rmustacc commented 1 year ago

A few thoughts on this. While I don't know the overall structure of the nexus data, here are a few thoughts that come to mind:

Data can't be lost if sled agent is doing conditional PUTs with what it things the prior generation number is and if it fails, sending a whole update and computing what's changed.
In general, you're going to want to send the whole state to initialize things and then only apply delta as an optimization. If the above ever fails, you just send the whole current state.

That may not work, but if you assume that nexus is the one that actually knows what the state of the world is and think of this as sending a topology snapshot and its up to nexus to figure out the diff, that'll could possibly make sense. In particular, Event A in the zpool case is, I think, more complex assuming a disk pull or fault. It's actually more like:

A disk was removed (this needs to be reported to update inventory tracking) or a disk was faulted and here's why. This involves communicating and updating two different things:
- The inventory-centric view of this disk, that is the disk as a primary key in inventory and recording this event happened.
- The sled-centric view of things and reporting that a device in slot N3 (for example) is gone.
Various storage volumes are gone (perhaps this is reported as the whole pool, as phrased above)

The reason to think about this is that Nexus can crash at any time processing this and folks may have to retry. Conditional PUTs (whether with etags or something else) help here. But also when a sled comes back up or has batched up a lot of changes because it's offline, it may just want to send the whole thing.

Obviously reporting and processing all state and the delta can be more expensive, but given we should be able to encapsulate a lot of this in a single topology view of the world that we serialize that may be easier than trying to note these all. Also, there's the critical thing that if sled agent crashes it can't assume it has given nexus any information because it literally doesn't know and should send the entire state of the world to have it make sense again.

smklein commented 1 year ago

A few thoughts on this. While I don't know the overall structure of the nexus data, here are a few thoughts that come to mind:

Data can't be lost if sled agent is doing conditional PUTs with what it things the prior generation number is and if it fails, sending a whole update and computing what's changed.

In general, you're going to want to send the whole state to initialize things and then only apply delta as an optimization. If the above ever fails, you just send the whole current state.

Great point, and for a first cut, we should probably just punt on the optimization. I think sending snapshots of "portions" of the device tree to Nexus may be a good incremental step from the current "single-zpool-at-a-time" API - if Sled Agent just sends a view of "here are all the zpools I see, with their corresponding states", Nexus can infer the deltas.

This seems to favor the approach of

"Alternatively, we could have a generation number to track the set of "all zpools" on a sled, as a single unit."

That I mentioned in my comment above.

rmustacc commented 1 year ago

A few thoughts on this. While I don't know the overall structure of the nexus data, here are a few thoughts that come to mind:

Data can't be lost if sled agent is doing conditional PUTs with what it things the prior generation number is and if it fails, sending a whole update and computing what's changed.

In general, you're going to want to send the whole state to initialize things and then only apply delta as an optimization. If the above ever fails, you just send the whole current state.

Great point, and for a first cut, we should probably just punt on the optimization. I think sending snapshots of "portions" of the device tree to Nexus may be a good incremental step from the current "single-zpool-at-a-time" API - if Sled Agent just sends a view of "here are all the zpools I see, with their corresponding states", Nexus can infer the deltas.

This seems to favor the approach of

"Alternatively, we could have a generation number to track the set of "all zpools" on a sled, as a single unit."

That I mentioned in my comment above.

So the one difference between what you described and what I tried to describe, I think, was that you had a separate thing for each sub-resource, e.g. a separate thing for all zpools or all say NICs, or other resources. I was trying to say that there is one, single payload for all of the gimlet and then deltas from there where that optimization makes sense and can be applied linearlizability (with the scheme that @davepacheco and @andrewjstone described).

andrewjstone commented 1 year ago

@smklein You're examples about last-write-wins and the problems there are dead on, and indeed that problem can be solved with a queue. Unfortunately the queue may lead to other issues as mentioned here. I think @rmustacc makes a good suggestion overall and I don't really see much problem with conditional writes as a whole for sled-agent state.

From an implementation point of view, this issue mentions that the notification calling sites are spread across sled agent. We could collate them with a channel internally to a single task that then performs the conditional updates, so we don't do that repeatedly and end up with a bunch of optimistic concurrency failures at CockroachDB.

oxidecomputer / omicron

[sled agent] Serialize Nexus notification queue #1917