Each OVS node run as heartbeat service. The service tries every 10 seconds to take a polling lock in Arakoon.
(see google doc for if/else layout below)
The controller-heartbeat, a small systemd service, executes following:
Take controller-heartbeat lock
If it can take the lock
Execute through the API get /alba/nodes/ to get the local summary of each asdnode (only real asd nodes, filter out the dual controller nodes and asd nodes part of the global backend).
client.get('/alba/nodes/', {'contents' : 'local_summary,type'})
Loop over each albanode
-If nr of osds in error > nr of osds in (OK, warning)
Take failover--lock
If it can take the fail-over lock
Release controller-heartbeat lock (so another node can check other nodes)
Check if ASD manager B is online by executing a get to the root of the ASD API.
In case there is no second ASD Manager defined, release the fail-over lock.
If the response status code != 200, repeat the check every 5 secs for 3 times. If it continues to fail, release the fail-over lock.
If response status code = 200, kill the controller with the failing disks through the IPMI extension IPMI. ipmitool -I lanplus -H <IPMIP> -U <username> -P <password> chassis power off
Check every 5 secs thrugh the IMPI extension (ipmitool -I lanplus -H <IPMIP> -U <username> -P <password> chassis status) and if (System Power != off) after 30 sec, release failover lock.
For each OSD set the node id to empty. Igf not able to set to empty leave old node id
For each OSD call the OSD_move on the API of the passive ASD manager. The API is called OSD per OSD (serial).
If all OSDs are moved, release the failover--lock
Else do nothing as someone else is doing the failover already
Else nothing todo as enough disks are not in error
Else nothing todo as another node is doing the check.
OSD_move
The OSD move function in an new API call on the ASD manager, input is osd_id. This executes following actions
Check if node id is empty. If not empty stop with error OSD still owner by ASD Manager , esle
Each OVS node run as heartbeat service. The service tries every 10 seconds to take a polling lock in Arakoon.
(see google doc for if/else layout below)
The controller-heartbeat, a small systemd service, executes following:
client.get('/alba/nodes/', {'contents' : 'local_summary,type'})
ipmitool -I lanplus -H <IPMIP> -U <username> -P <password> chassis power off
ipmitool -I lanplus -H <IPMIP> -U <username> -P <password> chassis status
) and if (System Power != off) after 30 sec, release failover lock.Else nothing todo as another node is doing the check.
OSD_move
https://docs.google.com/document/d/1Jzptv2gkq7xbnStq9r93fHsT8Yfw8qo4_1hYyzfilqc