lsst-uk / somerville-operations

User issue reporting and tracking for the Somerville Cloud
0 stars 0 forks source link

Write auto-shutdown script for Somerville #68

Open astrodb opened 1 year ago

astrodb commented 1 year ago

Develop a script/ansible-playbook to automate the shutdown of the Somerville Cloud and Ceph storage, used in the event of power outages or HVAC failure.

Investigate how to automate triggering of the script using ACF resources for those events.

GregBlow commented 1 year ago

Aim to return some progress by Jan 31st

GregBlow commented 1 year ago

First step is to get useful information out of UPS. As cabinet 4 is presently empty will enquire about possibility of using as a test subject.

GregBlow commented 1 year ago

likely not useful to pull power to a single rack. Current concept is ot use cron job on admin node to poll (plant room B) UPS interface as to connection status.

GregBlow commented 1 year ago

testing poses some problems. The signal that the system is running on UPS must come from a UPS that is serving production equipment, which will thus not be available for putting into test conditions. May need to divide the testing into testing the action on power failure and testing detection of power failure and address each separately, try to adhere them as neatly as possible.

astrodb commented 1 year ago

Major UPS works are scheduled for the ACF. There will be at least 2 shutdowns of our system for 1-2 hours each. It would be good to produce a shutdown script prior to this, and test it before the first shutdown. We can then focus on the trigger mechanism after the new UPS is in place.

GregBlow commented 1 year ago

Automation should perform all of the tasks described in this SOP:

https://www.wiki.ed.ac.uk/pages/viewpage.action?pageId=528172060

There are likely ansible libraries that can do this effectively, but due to the lack of test and development space it is likely better to run as a shell script, which uses proven commands. At least in first instance.

GregBlow commented 1 year ago

writing test of ceph system shutdown at sv-admin-0:/home/stack/gblow/autoshutdown using backup ceph system as testbed

GregBlow commented 1 year ago

ansible modules found so far written for deployment rather than management, working with shell scripts for now.

astrodb commented 1 year ago

I have scripts for the machines here at ROE if you want them for reference. Shutting down the OpenStack side of things will be more complicated though, and it might be worth checking with StackHPC to see if they have any similar scripts/tools used at other sites.

GregBlow commented 1 year ago

I think it should be ok, working through at a reasonable pace. Having discarded ansible as a mechanism the writing of it is faster (but the result will likely be less robust).

I'll script a first draft of the complete process and ask StackHPC to come back re: improvements afterwards.

astrodb commented 1 year ago

I think you'll want to use Ansible for the shutdowns, with BASH as the control. What I have at ROE is a shell script on a cron job that runs every 5min. If it detects a heat warning (sent to the server as an email from a temp sensor) it instigates a central shutdown shell script. Detection shell script looks like:

#!/bin/sh
## Script run through cron to check for temp alert emails
if grep -Fxq "TEMPALERT" /var/mail/heat
then
   mv /var/mail/heat /home/heat/alert.mail
   /root/ansible/bin/heat-shutdown.sh
fi

The shutdown control script looks like:

`#!/bin/bash
## Script for emergency shutdown of all servers due to heat situation in C2
#
## Shutdown test server to see if script is working
#/root/ansible/bin/test-shutdown

## Shutdown all virtual servers first, so physical hosts can shutdown quickly
/root/ansible/bin/vm-shutdown

## Shutdown Euclid virtual cluster using ansible script on sdc-uk
/root/ansible/bin/sdc-uk-shutdown

## Shutdown euclid services
/root/ansible/bin/euclid-services-shutdown`

And then those individual playbooks contain the actual shutdown commands and actions. That is especially useful for the Ceph cluster, which needs to have the pause/no-out/no-rebalance/no-recovery commands run and completed before the systems get ordered to power down.

GregBlow commented 1 year ago

Most of the shutdown commands are sent through kayobe (all of them except for sv-seed-vm-0 and sv-admin-0, at the end of the process). I could send these commands through ansible as https://docs.ansible.com/ansible/latest/collections/ansible/builtin/command_module.html, but I'm not sure that would be advantageous.

As for regulating the rate, I was thinking of using a while true loop on e.g. ceph -s to check for flags.

GregBlow commented 1 year ago

script is written in individual components that can be called sequentially in a main script, but I think due to differences between testing environment and live system it would be sensible to run each component individually at the scheduled maintenance and verify correct behaviour.

Presently the script is written with modifications to prevent accidental running (comments, display commands rather than actuation) which will need to be substituted, but these have designed to be quick to modify. An alternative once the live version is verified will be to delete the local clone of the code repository, but it would be best to combine this with the testing phase.

GregBlow commented 7 months ago

To All,

The ACF will be performing major work on the UPS system in the room where Somerville is housed, and they have warned us there will be a 4 day downtime in March/April. Let me know if there are any dates that are critically important for Somerville to be online for your project (training/public demos/etc), and I’ll pass those along to try and influence the final schedule. Once I know the dates I’ll pass them along.

Cheers,

Mark

Good test opportunity.

Also:

Write per-machine implementation (as opposed to centrally controlled). Consider how this affects clustered systems (e.g. ceph system, does controller need to issue and receive completion of commands before individual nodes power down?) Test on new hardware while not yet integrated.

GregBlow commented 6 months ago

Testing 22nd April as prelude to power works (taking advantage of necessary full maintenance)