Open astrodb opened 1 year ago
Aim to return some progress by Jan 31st
First step is to get useful information out of UPS. As cabinet 4 is presently empty will enquire about possibility of using as a test subject.
likely not useful to pull power to a single rack. Current concept is ot use cron job on admin node to poll (plant room B) UPS interface as to connection status.
testing poses some problems. The signal that the system is running on UPS must come from a UPS that is serving production equipment, which will thus not be available for putting into test conditions. May need to divide the testing into testing the action on power failure and testing detection of power failure and address each separately, try to adhere them as neatly as possible.
Major UPS works are scheduled for the ACF. There will be at least 2 shutdowns of our system for 1-2 hours each. It would be good to produce a shutdown script prior to this, and test it before the first shutdown. We can then focus on the trigger mechanism after the new UPS is in place.
Automation should perform all of the tasks described in this SOP:
https://www.wiki.ed.ac.uk/pages/viewpage.action?pageId=528172060
There are likely ansible libraries that can do this effectively, but due to the lack of test and development space it is likely better to run as a shell script, which uses proven commands. At least in first instance.
writing test of ceph system shutdown at sv-admin-0:/home/stack/gblow/autoshutdown using backup ceph system as testbed
ansible modules found so far written for deployment rather than management, working with shell scripts for now.
I have scripts for the machines here at ROE if you want them for reference. Shutting down the OpenStack side of things will be more complicated though, and it might be worth checking with StackHPC to see if they have any similar scripts/tools used at other sites.
I think it should be ok, working through at a reasonable pace. Having discarded ansible as a mechanism the writing of it is faster (but the result will likely be less robust).
I'll script a first draft of the complete process and ask StackHPC to come back re: improvements afterwards.
I think you'll want to use Ansible for the shutdowns, with BASH as the control. What I have at ROE is a shell script on a cron job that runs every 5min. If it detects a heat warning (sent to the server as an email from a temp sensor) it instigates a central shutdown shell script. Detection shell script looks like:
#!/bin/sh
## Script run through cron to check for temp alert emails
if grep -Fxq "TEMPALERT" /var/mail/heat
then
mv /var/mail/heat /home/heat/alert.mail
/root/ansible/bin/heat-shutdown.sh
fi
The shutdown control script looks like:
`#!/bin/bash
## Script for emergency shutdown of all servers due to heat situation in C2
#
## Shutdown test server to see if script is working
#/root/ansible/bin/test-shutdown
## Shutdown all virtual servers first, so physical hosts can shutdown quickly
/root/ansible/bin/vm-shutdown
## Shutdown Euclid virtual cluster using ansible script on sdc-uk
/root/ansible/bin/sdc-uk-shutdown
## Shutdown euclid services
/root/ansible/bin/euclid-services-shutdown`
And then those individual playbooks contain the actual shutdown commands and actions. That is especially useful for the Ceph cluster, which needs to have the pause/no-out/no-rebalance/no-recovery commands run and completed before the systems get ordered to power down.
Most of the shutdown commands are sent through kayobe (all of them except for sv-seed-vm-0 and sv-admin-0, at the end of the process). I could send these commands through ansible as https://docs.ansible.com/ansible/latest/collections/ansible/builtin/command_module.html, but I'm not sure that would be advantageous.
As for regulating the rate, I was thinking of using a while true loop on e.g. ceph -s to check for flags.
script is written in individual components that can be called sequentially in a main script, but I think due to differences between testing environment and live system it would be sensible to run each component individually at the scheduled maintenance and verify correct behaviour.
Presently the script is written with modifications to prevent accidental running (comments, display commands rather than actuation) which will need to be substituted, but these have designed to be quick to modify. An alternative once the live version is verified will be to delete the local clone of the code repository, but it would be best to combine this with the testing phase.
To All,
The ACF will be performing major work on the UPS system in the room where Somerville is housed, and they have warned us there will be a 4 day downtime in March/April. Let me know if there are any dates that are critically important for Somerville to be online for your project (training/public demos/etc), and I’ll pass those along to try and influence the final schedule. Once I know the dates I’ll pass them along.
Cheers,
Mark
Good test opportunity.
Also:
Write per-machine implementation (as opposed to centrally controlled). Consider how this affects clustered systems (e.g. ceph system, does controller need to issue and receive completion of commands before individual nodes power down?) Test on new hardware while not yet integrated.
Testing 22nd April as prelude to power works (taking advantage of necessary full maintenance)
Develop a script/ansible-playbook to automate the shutdown of the Somerville Cloud and Ceph storage, used in the event of power outages or HVAC failure.
Investigate how to automate triggering of the script using ACF resources for those events.