Closed davepacheco closed 3 months ago
Saw this on a March 28th iteration of dogfood
Adding to what @smklein said -- we have two cores at /staff/dock/rack2/mupdate-20240329/cores/sled-08
.
With mdb <core>
, then $c
:
core.oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7.nexus.10458.1711663318
:
Loading modules: [ libumem.so.1 libnvpair.so.1 libc.so.1 ld.so.1 ]
> $c
libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x22(6)
libc.so.1`abort+0x58()
0x5a207b9()
0x5a207a9()
rust_panic+0xd()
_ZN3std9panicking20rust_panic_with_hook17h7d19ef586749da2fE+0x2ab()
_ZN3std9panicking19begin_panic_handler28_$u7b$$u7b$closure$u7d$$u7d$17h95e3e1ca24551a68E+0xa4()
0x5a05ef9()
0x5a08802()
0x5a4f6c5()
0x5a4fd55()
_ZN95_$LT$nexus_db_queries..db..sec_store..CockroachDbSecStore$u20$as$u20$steno..store..SecStore$GT$12record_event28_$u7b$$u7b$closure$u7d$$u7d$17hc33ebb1cf636f348E+0xcec()
And core.oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7.nexus.5125.1711667683
:
libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x22(6)
libc.so.1`abort+0x58()
0x5a207b9()
0x5a207a9()
rust_panic+0xd()
_ZN3std9panicking20rust_panic_with_hook17h7d19ef586749da2fE+0x2ab()
_ZN3std9panicking19begin_panic_handler28_$u7b$$u7b$closure$u7d$$u7d$17h95e3e1ca24551a68E+0xa4()
0x5a05ef9()
0x5a08802()
0x5a4f6c5()
0x5a4f48a()
0x399e8a2()
_ZN98_$LT$alloc..vec..Vec$LT$T$GT$$u20$as$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$GT$9from_iter17h1c0a0b91677a258cE+0xc4()
_ZN159_$LT$omicron_nexus..app..background..sync_switch_configuration..SwitchPortSettingsManager$u20$as$u20$omicron_nexus..app..background..common..BackgroundTask$GT$8activate28_$u7b$$u7b$closure$u7d$$u7d$17h003d11b741f553f2E+0x26f91()
The first one 10458
seems to be https://github.com/oxidecomputer/omicron/blob/cf185c558347a894056c154087442914c4820905/nexus/db-queries/src/db/sec_store.rs#L65.
The second one 5125
is from somewhere in https://github.com/oxidecomputer/omicron/blob/17510a64780b86733b39300cfea9946f9623f0dd/nexus/src/app/background/sync_switch_configuration.rs#L275.
The second pid 5125 one is unrelated to this issue -- have filed https://github.com/oxidecomputer/omicron/issues/5361 for that.
I believe there still exists an unwrap here, which can cause nexus to panic if CRDB is unavailable:
This was the root cause of https://github.com/oxidecomputer/omicron/issues/6090
Ah yeah you're right, shouldn't have closed this. Sorry!
Ah yeah you're right, shouldn't have closed this. Sorry!
No worries, it was easy to miss. Do you wanna take fixing this one, or should I?
I'll pick it up, thanks!
@sunshowers see also https://github.com/oxidecomputer/omicron/issues/6090#issuecomment-2229509411
@davepacheco thanks -- the first one can be done easily enough I hope, but does the second one need optimistic concurrency/a generation number? If so, then we should just implement that.
I think it already has that, just using a generation number that's made up? The "adopt_generation" was intended to be bumped whenever a takeover happens, but we haven't implemented that.
I'm not sure it's worth implementing a case we can't have in production and so can't test. What would we do if the OCC update fails for some reason?
Grepped through omicron for "2416" and didn't see any other results, so I think this is done for real now.
Related: oxidecomputer/steno#302. I hit this because in my PR had I broken things so that this retry loop continued hitting a permanent error:
00:19:42.794Z ERRO c39c0c31-e34e-4b77-9616-2a0bf956f9b6 (ServerContext): client error while updating saga state (likely requires operator intervention), retrying anyway
call_count = 12
error = Invalid Request: failed to update saga SagaId(9879e3fd-ae9b-4d3d-b960-2b8c3227f3b2) with state Done: preconditions not met: expected current_sec = c39c0c31-e34e-4b77-9616-2a0bf956f9b6, adopt_generation = Generation(Generation(1)), but found current_sec = Some(c39c0c31-e34e-4b77-9616-2a0bf956f9b6), adopt_generation = Generation(Generation(2)), state = SagaCachedState(Running)
file = nexus/db-queries/src/db/sec_store.rs:147
new_state = done
saga_id = 9879e3fd-ae9b-4d3d-b960-2b8c3227f3b2
total_duration = 612.261975772s
During that time, I couldn't list sagas. My guess is that other sagas couldn't complete, either. I think Nexus is doing nothing wrong and that Steno ought to behave better here but I figured I'd mention it here so folks were aware.
Nexus SEC operations currently panic if they fail: https://github.com/oxidecomputer/omicron/blob/cb3d713a6ec7d1515bce61c6073dd460ac6b9f87/nexus/src/db/sec_store.rs#L63-L65
They should use a retry loop instead. I think this should be pretty straightforward as long as we're willing to block saga progress on CockroachDB coming back (which it seems like we should).