propolis logs lost on reboot

faithanalog commented 1 week ago

While investigating an apparent crucible bug, we ended up with a sled crashed into kmdb. We had retrieved some propolis logs from that sled, but not all the ones we wanted.

Upon rebooting the sled, all of the propolis datasets appeared to have been deleted, and the propolis logs we wanted along with them.

The same zpools were present, so I do think that this was some system cleaning up after VMs that no longer existed after the sled crashed, but I'm not sure what exactly might have done this.

smklein commented 1 week ago

See:

https://github.com/oxidecomputer/omicron/blob/8d730794ac08ac0b70fd16628ccd3057ce048a0e/sled-storage/src/dataset.rs#L65-L66

https://github.com/oxidecomputer/omicron/blob/8d730794ac08ac0b70fd16628ccd3057ce048a0e/sled-storage/src/dataset.rs#L110-L111

https://github.com/oxidecomputer/omicron/blob/8d730794ac08ac0b70fd16628ccd3057ce048a0e/sled-storage/src/dataset.rs#L285-L306

Unless you're reporting this as a regression, I think this is currently expected behavior. We are destroying all transient zone filesystems when the sled reboots, at the moment.

There are related issues to make the set of datasets less "implicit", and more "managed by Nexus". Of these, I'd say:

Are probably most relevant.

In particular:

If we finish making Nexus aware of all U.2 dataset allocations, we can avoid this periodic "clear-on-reboot" behavior to garbage collect old instance filesystems.
... then we can make more significant progress on re-constructing instance state, rather than destroying them on boot.

faithanalog commented 1 week ago

This is all good background. I figured it was expected but didn't know anything about the mechanism.

The thing that bothers me about the behavior is mainly the loss of diagnostic data in the log files. My hope is that we could archive the logs from the zone filesystem somewhere before destroying the dataset (though- how would we manage the lifecycle of those logs after we do this?)

smklein commented 1 week ago

The Zone Bundler in sled-agent/src/zone_bundle.rs exists, and was created to take snapshots of unexpectedly dying zones. This may be a spot where we could re-use it.

oxidecomputer / omicron

propolis logs lost on reboot #7012