Open faithanalog opened 1 week ago
See:
Unless you're reporting this as a regression, I think this is currently expected behavior. We are destroying all transient zone filesystems when the sled reboots, at the moment.
There are related issues to make the set of datasets less "implicit", and more "managed by Nexus". Of these, I'd say:
Are probably most relevant.
In particular:
This is all good background. I figured it was expected but didn't know anything about the mechanism.
The thing that bothers me about the behavior is mainly the loss of diagnostic data in the log files. My hope is that we could archive the logs from the zone filesystem somewhere before destroying the dataset (though- how would we manage the lifecycle of those logs after we do this?)
The Zone Bundler in sled-agent/src/zone_bundle.rs
exists, and was created to take snapshots of unexpectedly dying zones. This may be a spot where we could re-use it.
While investigating an apparent crucible bug, we ended up with a sled crashed into
kmdb
. We had retrieved some propolis logs from that sled, but not all the ones we wanted.Upon rebooting the sled, all of the propolis datasets appeared to have been deleted, and the propolis logs we wanted along with them.
The same zpools were present, so I do think that this was some system cleaning up after VMs that no longer existed after the sled crashed, but I'm not sure what exactly might have done this.