oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

Nexus SEC operations panic when CockroachDB is unavailable #2416

Closed davepacheco closed 3 months ago

davepacheco commented 1 year ago

Nexus SEC operations currently panic if they fail: https://github.com/oxidecomputer/omicron/blob/cb3d713a6ec7d1515bce61c6073dd460ac6b9f87/nexus/src/db/sec_store.rs#L63-L65

They should use a retry loop instead. I think this should be pretty straightforward as long as we're willing to block saga progress on CockroachDB coming back (which it seems like we should).

smklein commented 7 months ago

Saw this on a March 28th iteration of dogfood

sunshowers commented 7 months ago

Adding to what @smklein said -- we have two cores at /staff/dock/rack2/mupdate-20240329/cores/sled-08.

With mdb <core>, then $c:

core.oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7.nexus.10458.1711663318:

Loading modules: [ libumem.so.1 libnvpair.so.1 libc.so.1 ld.so.1 ]
> $c
libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x22(6)
libc.so.1`abort+0x58()
0x5a207b9()
0x5a207a9()
rust_panic+0xd()
_ZN3std9panicking20rust_panic_with_hook17h7d19ef586749da2fE+0x2ab()
_ZN3std9panicking19begin_panic_handler28_$u7b$$u7b$closure$u7d$$u7d$17h95e3e1ca24551a68E+0xa4()
0x5a05ef9()
0x5a08802()
0x5a4f6c5()
0x5a4fd55()
_ZN95_$LT$nexus_db_queries..db..sec_store..CockroachDbSecStore$u20$as$u20$steno..store..SecStore$GT$12record_event28_$u7b$$u7b$closure$u7d$$u7d$17hc33ebb1cf636f348E+0xcec()
remaining frames ``` _ZN5steno3sec3Sec12executor_log28_$u7b$$u7b$closure$u7d$$u7d$17hc49d437d71bb12e7E+0x53() _ZN117_$LT$futures_util..stream..futures_unordered..FuturesUnordered$LT$Fut$GT$$u20$as$u20$futures_core..stream..Stream$GT$9poll_next17hfd98f22f6167186cE+0x1ac() _ZN88_$LT$tokio..future..poll_fn..PollFn$LT$F$GT$$u20$as$u20$core..future..future..Future$GT$4poll17h87f05f880b8d9347E+0x28c() _ZN5steno3sec3sec28_$u7b$$u7b$closure$u7d$$u7d$17he80dac5e7ee8228cE.llvm.5074909589756572123+0x3b9() _ZN5tokio7runtime4task4core17Core$LT$T$C$S$GT$4poll17hb16ad800f6b83c3eE+0x49() _ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17h4c987821e435a118E+0x48() _ZN5tokio7runtime9scheduler12multi_thread6worker7Context8run_task17h5720ac884dcfa807E+0x146() _ZN5tokio7runtime7context6scoped15Scoped$LT$T$GT$3set17h146604d8d66fce1bE+0xabe() _ZN5tokio7runtime7context7runtime13enter_runtime17had12f1b61675d596E+0x193() _ZN5tokio7runtime9scheduler12multi_thread6worker3run17hfc17fe6c20967720E+0x4b() _ZN5tokio7runtime4task4core17Core$LT$T$C$S$GT$4poll17h9cfb48bd55732e89E+0x73() _ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17h5f94343b6f772f15E+0x97() _ZN5tokio7runtime8blocking4pool5Inner3run17h35df801e4bd32ce5E+0xe4() _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h53b7bfcbc6b4686bE+0x3f() _ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17h9026fcd303042064E+0x75() _ZN3std3sys4unix6thread6Thread3new12thread_start17h1783cbcbbf061711E+0x29() libc.so.1`_thrp_setup+0x77(fffff5ffeddcda40) libc.so.1`_lwp_start() ```

And core.oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7.nexus.5125.1711667683:

libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x22(6)
libc.so.1`abort+0x58()
0x5a207b9()
0x5a207a9()
rust_panic+0xd()
_ZN3std9panicking20rust_panic_with_hook17h7d19ef586749da2fE+0x2ab()
_ZN3std9panicking19begin_panic_handler28_$u7b$$u7b$closure$u7d$$u7d$17h95e3e1ca24551a68E+0xa4()
0x5a05ef9()
0x5a08802()
0x5a4f6c5()
0x5a4f48a()
0x399e8a2()
_ZN98_$LT$alloc..vec..Vec$LT$T$GT$$u20$as$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$GT$9from_iter17h1c0a0b91677a258cE+0xc4()
_ZN159_$LT$omicron_nexus..app..background..sync_switch_configuration..SwitchPortSettingsManager$u20$as$u20$omicron_nexus..app..background..common..BackgroundTask$GT$8activate28_$u7b$$u7b$closure$u7d$$u7d$17h003d11b741f553f2E+0x26f91()
remaining frames ``` _ZN13omicron_nexus3app10background6common8TaskExec8activate28_$u7b$$u7b$closure$u7d$$u7d$17h60138c0e224896edE+0x233() _ZN13omicron_nexus3app10background6common8TaskExec3run28_$u7b$$u7b$closure$u7d$$u7d$17hb86814c79d6f1a8fE.llvm.10434374178345103703+0x467() _ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17hdf5fcd0b9767bf4aE+0x83() _ZN5tokio7runtime9scheduler12multi_thread6worker7Context8run_task17h5720ac884dcfa807E+0x190() _ZN5tokio7runtime7context6scoped15Scoped$LT$T$GT$3set17h146604d8d66fce1bE+0xabe() _ZN5tokio7runtime7context7runtime13enter_runtime17had12f1b61675d596E+0x193() _ZN5tokio7runtime9scheduler12multi_thread6worker3run17hfc17fe6c20967720E+0x4b() _ZN5tokio7runtime4task4core17Core$LT$T$C$S$GT$4poll17h9cfb48bd55732e89E+0x73() _ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17h5f94343b6f772f15E+0x97() _ZN5tokio7runtime8blocking4pool5Inner3run17h35df801e4bd32ce5E+0xe4() _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h53b7bfcbc6b4686bE+0x3f() _ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17h9026fcd303042064E+0x75() _ZN3std3sys4unix6thread6Thread3new12thread_start17h1783cbcbbf061711E+0x29() libc.so.1`_thrp_setup+0x77(fffff5ffedddda40) libc.so.1`_lwp_start() ```
sunshowers commented 7 months ago

The first one 10458 seems to be https://github.com/oxidecomputer/omicron/blob/cf185c558347a894056c154087442914c4820905/nexus/db-queries/src/db/sec_store.rs#L65.

The second one 5125 is from somewhere in https://github.com/oxidecomputer/omicron/blob/17510a64780b86733b39300cfea9946f9623f0dd/nexus/src/app/background/sync_switch_configuration.rs#L275.

sunshowers commented 7 months ago

The second pid 5125 one is unrelated to this issue -- have filed https://github.com/oxidecomputer/omicron/issues/5361 for that.

smklein commented 4 months ago

I believe there still exists an unwrap here, which can cause nexus to panic if CRDB is unavailable:

https://github.com/oxidecomputer/omicron/blob/031b5ecb6bebae76100d1f9240d258c0dcbbc91e/nexus/db-queries/src/db/sec_store.rs#L127-L142

This was the root cause of https://github.com/oxidecomputer/omicron/issues/6090

sunshowers commented 4 months ago

Ah yeah you're right, shouldn't have closed this. Sorry!

smklein commented 4 months ago

Ah yeah you're right, shouldn't have closed this. Sorry!

No worries, it was easy to miss. Do you wanna take fixing this one, or should I?

sunshowers commented 4 months ago

I'll pick it up, thanks!

davepacheco commented 4 months ago

@sunshowers see also https://github.com/oxidecomputer/omicron/issues/6090#issuecomment-2229509411

sunshowers commented 4 months ago

@davepacheco thanks -- the first one can be done easily enough I hope, but does the second one need optimistic concurrency/a generation number? If so, then we should just implement that.

davepacheco commented 4 months ago

I think it already has that, just using a generation number that's made up? The "adopt_generation" was intended to be bumped whenever a takeover happens, but we haven't implemented that.

I'm not sure it's worth implementing a case we can't have in production and so can't test. What would we do if the OCC update fails for some reason?

sunshowers commented 3 months ago

Grepped through omicron for "2416" and didn't see any other results, so I think this is done for real now.

davepacheco commented 3 months ago

Related: oxidecomputer/steno#302. I hit this because in my PR had I broken things so that this retry loop continued hitting a permanent error:

00:19:42.794Z ERRO c39c0c31-e34e-4b77-9616-2a0bf956f9b6 (ServerContext): client error while updating saga state (likely requires operator intervention), retrying anyway
    call_count = 12
    error = Invalid Request: failed to update saga SagaId(9879e3fd-ae9b-4d3d-b960-2b8c3227f3b2) with state Done: preconditions not met: expected current_sec = c39c0c31-e34e-4b77-9616-2a0bf956f9b6, adopt_generation = Generation(Generation(1)), but found current_sec = Some(c39c0c31-e34e-4b77-9616-2a0bf956f9b6), adopt_generation = Generation(Generation(2)), state = SagaCachedState(Running)
    file = nexus/db-queries/src/db/sec_store.rs:147
    new_state = done
    saga_id = 9879e3fd-ae9b-4d3d-b960-2b8c3227f3b2
    total_duration = 612.261975772s

During that time, I couldn't list sagas. My guess is that other sagas couldn't complete, either. I think Nexus is doing nothing wrong and that Steno ought to behave better here but I figured I'd mention it here so folks were aware.