Closed davepacheco closed 4 months ago
If I'm understanding this issue correctly, we're saying that the events should be paginated in "node ID" order, not by "saga ID" order, flagging this line as incorrect:
I also noticed that this function calls into https://github.com/oxidecomputer/omicron/blob/cb3a9bf2d49df9b048f590d2340b0a8a07819f01/nexus/db-queries/src/db/saga_recovery.rs#L292-L293C23 to actually get the batch.
This function also seems incorrect, as it is using the pagination helper, but is ordering by saga_id
, not node_id
: https://github.com/oxidecomputer/omicron/blob/cb3a9bf2d49df9b048f590d2340b0a8a07819f01/nexus/db-queries/src/db/datastore/saga.rs#L131-L138
Yes, I think that's the line that's wrong, and agreed that saga_node_event_list_by_id()
is also sorting on the wrong column. Since the pagination key is supposed to be unique, and there can be multiple events for the same node_id, I think we'll probably want the pagination key to be the tuple of (node_id, event_type)
.
I'd suggest that we convert this code over to the more modern Paginator
. It should be pretty easy to use and it looks like it supports paginating by two columns like this.
Following the story of #5947: I went to dig into this error message:
Here, Steno is reporting an internal error recovering the saga because its log seems to reflect an impossible state. Yikes -- this sounds scary. What's the state of this saga?
I trimmed
saga_dag
out of that output because it's enormous and turned out not to be relevant. But I did wind up digging into it a bit. Here's the log of saga events, ignoring the data produced by each node:This is complicated enough that I wanted to visualize it. But first I wanted to better understand the specific problem that's being reported. I started working through the recovery code, which involves the Nexus side calling into Steno to resume the saga. That lands in
cmd_saga_resume
inside Steno. There's a log message there -- and it's in the Nexus log! Great.So what's the error saying? That ends up coming from here, where we create an executor to resume running the saga. It's saying: node 0 has an ancestor (i.e., it depends on another node), and that node hasn't started -- yet node 0 supposedly finished. This was surprising to me on a few levels: obviously something shouldn't be able to start if its dependency didn't finish, but also, isn't node 0 the start node?
I confirmed from a similar query above that includes the "data" field that node 0 really did start and finish -- with data and everything:
But I still didn't get why this had any incoming edges.
Aside: trying to visualize this saga's DAG
I thought it'd be useful to visualize the saga's DAG and I know Steno has a way to print a
dot
graph. Fortunately, the"saga resume"
log message includes the serializedSagaDag
. I saved this to a file. Steno doesn't have a tool for reading this in, but I made one pretty easily, calling itexamples/saga-dump.rs
in asteno
clone:I ran it as
cargo run --example=saga-dump -- dag.json > dag.dot
, then converted that to a PNG withdot -o dag.png -Tpng dag.dot
. This took surprisingly long -- like 10+ seconds to rundot
! I thought it was broken, but it's just that this is a huge DAG. I did manage to open it in Preview, but this wasn't that helpful -- it's a huge, pretty linear graph. Still, the dot file was pretty useful as a much clearer way to view the nodes and their dependencies.Resetting a bit: I'm trying to figure out what points to node 0? And this is easily answered by the dot file, which ends with:
Node 182 points to node 0? And then I noticed this in the DAG:
Okay :facepalm: I just misremembered how this worked. Steno first creates all the nodes that the consumer gave it, then creates the start node (182 in this case) and that becomes a dependency of the first user-created node (which is thus 0). So fine -- we understand now why node 0 has a dependency. But then: what's the state of node 182? Did it finish or not? Steno reports its status is
NeverStarted
-- how did it get that?At this point I was worried that saga recovery was completely broken, so I did a quick check using Steno's example. First, I did a full run and saved the log:
Then I manually chopped off the last few entries of the log (so that it would appear like node 11 was in-progress and node 13 hadn't run) and recovered it:
Good -- that's correct. In particular, we can see that the loaded state reflects that the second-to-last node hasn't finished, yet it finishes by the end. So this isn't totally broken.
I also noticed about this example that:
That led me to ask: do we have log entries for node 182 in our saga's log? We do! (See above.) Though it's interesting that they appear relatively late in the log. Now, while looking at the code I noticed that Steno sorts the log semantically (not by timestamp) to make it easier to iterate them in an order that will make sense. Is that somehow not accounting for the start node? After some time with that code, it didn't seem like that should matter: all it depends on is having seen the "start" for a node before it's "finished", and that should be the case here even if the start node winds up being processed last.
I think it was about this time I wondered if somehow we hadn't loaded the log entries for node 182 from the database. I looked at the code that fetched the
saga_node_event
rows... and spotted the bug: when it fetches the log, the pagination code is hand-rolled (it long predatesPaginator
) -- and it incorrectly paginates using the saga_id as the marker. But the saga_id for all of these entries is the same! So the behavior you'd expect from this code is:None
) withsaga_id
= our saga.saga_id
> our saga. (We're still correctly filtering onsaga_id
= our saga.) That's wrong -- that'll find no more rows.We want to be paginating using the node id (and probably the event type, to ensure uniqueness) as the marker instead.
Does this actually explain the behavior? First, of all, do we even have more than one page of rows? We do: the page size is 100 and there are 148 rows in the output above. For this explanation to work, node 182's rows would have to not be in the first page, which is a little surprising because at least PostgreSQL tends to sort by insertion order (though it's not guaranteed if the
ORDER BY
doesn't specify). But we can see from the output above that even though node 182's timestamps are the earliest, it appears last by the natural sort.Still, I'd feel better if I could see the problem in action using this same database. So I wrote a little
omdb
command that basically does the following:None
, and prints them outThis basically lets me do the same thing the recovery path would do via omdb, where I can easily print stuff out. I could also fix the implementation and run the same thing and verify that it works now. (I haven't done that part yet.)
Here's the source for the omdb command:
and here's the output:
This is pretty close to a smoking gun. This shows that:
I also reproduced from SQL by first listing everything in timestamp order to show that it looks like what we'd expect:
Notice that node 182 shows up first as we'd expect. If we limit to the first 100, we get it, and that's what we expect. If we order by saga_id instead, which is what it looks like the code would do:
It happens to put node 182 last, and it's not in the first 100:
So I believe that explains everything and the fix should be fairly straightforward.
How long has this been broken? The pagination code was rewritten when we introduced Diesel back in late 2021 and it appears this issue was introduced then, though it probably could not be triggered until later, when we first had sagas with more than about 50 nodes (enough nodes to push the node events into the second page). However, the conditions required to hit it are somewhat specific and unpredictable: you need a saga large enough, plus a Nexus crash, plus whatever conditions cause CockroachDB to put the start node at the end of the result set.
In principle, it's conceivable that we could have a much worse failure mode: suppose we started executing a saga, then Nexus restarted, and when it came up, it got an incomplete but valid log. Steno might resume execution but actually wind up rerunning some nodes that had already run before. It's not clear how arbitrarily bad things could get in this case. However, I think this would be impossible because:
(saga_id, node_id, event_type)
, which means that if Steno went to execute the same action twice, it should find a conflict here and fail to proceed