oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
250 stars 39 forks source link

mgs panics when out of disk space for slog #4352

Open wesolows opened 12 months ago

wesolows commented 12 months ago

A core (2 of them, actually) from mgs was recovered from the field with this stack:

> $C
fffff5ffce1ff970 libc.so.1`_lwp_kill+0xa()
fffff5ffce1ff9a0 libc.so.1`raise+0x22(6)
fffff5ffce1ff9f0 libc.so.1`abort+0x58()
fffff5ffce1ffa00 _ZN11panic_abort18__rust_start_panic5abort17h0cfb9e9813446f8bE+9()
fffff5ffce1ffa10 __rust_start_panic+9()
fffff5ffce1ffab0 rust_panic+0x10()
fffff5ffce1ffb60 _ZN3std9panicking20rust_panic_with_hook17ha18432c291108e47E+0x2a2()
fffff5ffce1ffbb0 _ZN3std9panicking19begin_panic_handler28_$u7b$$u7b$closure$u7d$$u7d$17h4645986a55c3ef2bE+0xc6()
fffff5ffce1ffbc0 _ZN3std10sys_common9backtrace26__rust_end_short_backtrace17h6107fb71e173f9a5E+9()
fffff5ffce1ffc00 rust_begin_unwind+0x71()
fffff5ffce1ffc40 _ZN4core9panicking9panic_fmt17h863016252fdb1147E+0x33()
fffff5ffce1ffcb0 _ZN51_$LT$slog..Fuse$LT$D$GT$$u20$as$u20$slog..Drain$GT$3log17hee8a52b66e11592aE+0x96()
fffff5ffce1ffe40 _ZN3std9panicking3try17h5982bf3ca9960eb9E+0x163()
fffff5ffce1ffec0 _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h74a6895fe1297c3dE+0x25()
fffff5ffce1fff60 _ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17h37ca6e2899dc45c4E+0x9e()
fffff5ffce1fffb0 _ZN3std3sys4unix6thread6Thread3new12thread_start17h0604336de98f7b8bE+0x29()
fffff5ffce1fffe0 libc.so.1`_thrp_setup+0x77(fffff5ffee7e0240)
fffff5ffce1ffff0 libc.so.1`_lwp_start()

As Rust exists for the sole purpose of making simple things hard, the only approach I could attempt is to examine the stack for pointers to things that might be strings that would give some indication as to what went wrong. Similar to oxidecomputer/dendrite#665, I eventually found:

> 1f8d330/s
0x1f8d330:      slog::Fuse Drain: Os { code: 28, kind: StorageFull, message: "No space left on device" }

This isn't a reason to panic: logging configuration is discussed to some extent in #1014 but only in the context of whether logging should be synchronous with respect to calling code. In this case, the problem is that we're neither blocking on disk space becoming available nor dropping the record. Panicking will eventually put the service into the maintenance state, which cannot be recovered without manual intervention and is therefore not a scalable/sustainable approach. Dropping the record is probably best here, but another possibility is that this somehow needs to raise the attention of sled-agent (which in this case is not running in this zone) which could potentially take some action to free up storage. In this particular instance, that's not going to help: the switch zone is (for reasons?) on the ramdisk with the OS, which is quite small. We could implement a quota for the filesystem it's on, but that will only protect the rest of the system from the switch zone processes; this bug would still exist and we'd still hit it.

I did go look at slog_async to see why we panic here, and as usual in Rust I couldn't find any code that actually does anything. But we seem to get here: https://github.com/slog-rs/slog/blob/945dc8b8557b8d351926e2590a771439ab0a73b5/src/lib.rs#L1935 and this behaviour is apparently intentional (WTF?!). So I suspect this is really a configuration error by which we end up using this Fuse Drain instead of one that will handle errors in a more sensible manner. Note that this is a bit different from some of the other out-of-space panics I've seen, in that (a) we don't seem to be using slog_async here and (b) there's a somewhat more obvious reason we're panicking: because we're using this Fuse thing. Seems bad.

The second core is different and will be covered by a separate bug. Core is catacomb:/data/staff/core/customer-support/52/pool/ext/2081704d-aed2-4676-92ce-7f8a576d66ad/crypt/debug/core.oxz_switch.mgs.23522.1697600107.

ahl commented 12 months ago

As Rust exists for the sole purpose of making simple things hard

In fairness, Rust also makes hard things hard.