As Rust exists for the sole purpose of making simple things hard, the only approach I could attempt is to examine the stack for pointers to things that might be strings that would give some indication as to what went wrong. Similar to oxidecomputer/dendrite#665, I eventually found:
> 1f8d330/s
0x1f8d330: slog::Fuse Drain: Os { code: 28, kind: StorageFull, message: "No space left on device" }
This isn't a reason to panic: logging configuration is discussed to some extent in #1014 but only in the context of whether logging should be synchronous with respect to calling code. In this case, the problem is that we're neither blocking on disk space becoming available nor dropping the record. Panicking will eventually put the service into the maintenance state, which cannot be recovered without manual intervention and is therefore not a scalable/sustainable approach. Dropping the record is probably best here, but another possibility is that this somehow needs to raise the attention of sled-agent (which in this case is not running in this zone) which could potentially take some action to free up storage. In this particular instance, that's not going to help: the switch zone is (for reasons?) on the ramdisk with the OS, which is quite small. We could implement a quota for the filesystem it's on, but that will only protect the rest of the system from the switch zone processes; this bug would still exist and we'd still hit it.
I did go look at slog_async to see why we panic here, and as usual in Rust I couldn't find any code that actually does anything. But we seem to get here: https://github.com/slog-rs/slog/blob/945dc8b8557b8d351926e2590a771439ab0a73b5/src/lib.rs#L1935 and this behaviour is apparently intentional (WTF?!). So I suspect this is really a configuration error by which we end up using this Fuse Drain instead of one that will handle errors in a more sensible manner. Note that this is a bit different from some of the other out-of-space panics I've seen, in that (a) we don't seem to be using slog_async here and (b) there's a somewhat more obvious reason we're panicking: because we're using this Fuse thing. Seems bad.
The second core is different and will be covered by a separate bug.
Core is catacomb:/data/staff/core/customer-support/52/pool/ext/2081704d-aed2-4676-92ce-7f8a576d66ad/crypt/debug/core.oxz_switch.mgs.23522.1697600107.
A core (2 of them, actually) from mgs was recovered from the field with this stack:
As Rust exists for the sole purpose of making simple things hard, the only approach I could attempt is to examine the stack for pointers to things that might be strings that would give some indication as to what went wrong. Similar to oxidecomputer/dendrite#665, I eventually found:
This isn't a reason to panic: logging configuration is discussed to some extent in #1014 but only in the context of whether logging should be synchronous with respect to calling code. In this case, the problem is that we're neither blocking on disk space becoming available nor dropping the record. Panicking will eventually put the service into the maintenance state, which cannot be recovered without manual intervention and is therefore not a scalable/sustainable approach. Dropping the record is probably best here, but another possibility is that this somehow needs to raise the attention of sled-agent (which in this case is not running in this zone) which could potentially take some action to free up storage. In this particular instance, that's not going to help: the switch zone is (for reasons?) on the ramdisk with the OS, which is quite small. We could implement a quota for the filesystem it's on, but that will only protect the rest of the system from the switch zone processes; this bug would still exist and we'd still hit it.
I did go look at
slog_async
to see why we panic here, and as usual in Rust I couldn't find any code that actually does anything. But we seem to get here: https://github.com/slog-rs/slog/blob/945dc8b8557b8d351926e2590a771439ab0a73b5/src/lib.rs#L1935 and this behaviour is apparently intentional (WTF?!). So I suspect this is really a configuration error by which we end up using this Fuse Drain instead of one that will handle errors in a more sensible manner. Note that this is a bit different from some of the other out-of-space panics I've seen, in that (a) we don't seem to be usingslog_async
here and (b) there's a somewhat more obvious reason we're panicking: because we're using this Fuse thing. Seems bad.The second core is different and will be covered by a separate bug. Core is catacomb:/data/staff/core/customer-support/52/pool/ext/2081704d-aed2-4676-92ce-7f8a576d66ad/crypt/debug/core.oxz_switch.mgs.23522.1697600107.