Post-mortem of RTS on 26th April 2024, following on from ACF power works in Computer Room #1

markgbeckett commented 2 months ago

Somerville return to service, following the power ACF power work during 22nd--26th April did not go as smoothly as planned. We should post-mortem the experience, to help us better prepare for next time.

High level things of note:

CDU failed during power down and required repair (from HPE)
Switch failed to pick up correct configuration when powered on, leaving some Ceph storage unavailable.
Tempest test suite did not run smoothly, because of preexisting configuration issues.
Routing and permissions problems experienced by Lasair team following return to service
Issues experienced by Lasair team mounting CephFS
Recovery interrupted to allow staff to relocate from ACF at the end of the working day

astrodb commented 2 months ago

Quick notes:

Not sure what we can do about the CDU issue other than note and report it.
Discussed with Greg, and should be fixed. Previous changes were done by Daniel as part of training, but not saved properly and hence were lost on the reboot. Should not happen again.
This is a known issue and the immediate documentation has been updated for next time. Longer term we need to sort out the failed tests with StackHPC and establish if those are expected failures or not (which is a label in Tempest).
Our restart plan needs to be adjusted. It does appear a full restart then requires a rolling reboot of controller nodes to bring some things back online. It's not clear why this happens, but should be investigated.
This is likely another config/permissions change which wasn't saved and/or reapplied correctly on the restart. Will open the old issue and invrstigate.
This can't be helped, and was just a matter of reality to be coped with. Hopefully these power downs are an infrequent event.

GregBlow commented 1 month ago

CDU issue is a stochastic failure; things fall apart.
& 5. were consequences of 2.

GregBlow commented 1 month ago

1a. sv-ssd-0-7 mgmt port failure was, similarly, a hardware failure. However it was not a critical factor to restoring service

lsst-uk / somerville-operations

Post-mortem of RTS on 26th April 2024, following on from ACF power works in Computer Room #1 #166