lsst-uk / somerville-operations

User issue reporting and tracking for the Somerville Cloud
0 stars 0 forks source link

Post-mortem of RTS on 26th April 2024, following on from ACF power works in Computer Room #1 #166

Closed markgbeckett closed 1 week ago

markgbeckett commented 2 months ago

Somerville return to service, following the power ACF power work during 22nd--26th April did not go as smoothly as planned. We should post-mortem the experience, to help us better prepare for next time.

High level things of note:

astrodb commented 2 months ago

Quick notes:

  1. Not sure what we can do about the CDU issue other than note and report it.
  2. Discussed with Greg, and should be fixed. Previous changes were done by Daniel as part of training, but not saved properly and hence were lost on the reboot. Should not happen again.
  3. This is a known issue and the immediate documentation has been updated for next time. Longer term we need to sort out the failed tests with StackHPC and establish if those are expected failures or not (which is a label in Tempest).
  4. Our restart plan needs to be adjusted. It does appear a full restart then requires a rolling reboot of controller nodes to bring some things back online. It's not clear why this happens, but should be investigated.
  5. This is likely another config/permissions change which wasn't saved and/or reapplied correctly on the restart. Will open the old issue and invrstigate.
  6. This can't be helped, and was just a matter of reality to be coped with. Hopefully these power downs are an infrequent event.
GregBlow commented 1 month ago
  1. CDU issue is a stochastic failure; things fall apart.
  2. & 5. were consequences of 2.
GregBlow commented 1 month ago

1a. sv-ssd-0-7 mgmt port failure was, similarly, a hardware failure. However it was not a critical factor to restoring service