spaceconcordia / SpacecraftSoftware

Space Concordia
Apache License 2.0
7 stars 0 forks source link

Watchcat #21

Open philip-brink opened 6 years ago

philip-brink commented 6 years ago

Goal

Setup monit to monitor the processes, and tests to ensure that they are restarted when they crash.

Notes

Initially just setup monit with a dummy process, and ensure that it is restarted when killed. Once more of the competition issues are finished, this should be changed to properly take into account the real processes.

tatumalenko commented 6 years ago

Systemd Timers

Test script

Throughout my exploration into finding a way to use systemd to allow for a delayed execution of a script given a prescribed offset time (e.g. 30 min) that would also incorporate automatic crash recovery upon restart, I used a simple script that would echo a simple sentence appended with its date of execution:

#!/bin/bash
#/home/parallels/Desktop/timertest.bash
echo Timertest was launched at $(date +"%T")

Timer Units: .service + .timer

Service Unit

Create .service unit named timertest.service in /etc/systemd/system directory (user created systemd units):

touch /etc/systemd/system/timertest.service

Add the following to the new empty file:

#/etc/systemd/system/timertest.service
[Unit]
Description=TimerTest

[Service]
ExecStart=/bin/bash /home/parallels/Desktop/timertest.bash

Timer Unit

Create .timer unit named timertest.timer in /etc/systemd/system directory (user created systemd units):

touch /etc/systemd/system/timertest.timer

Add the following to the new empty file:

#/etc/systemd/system/timertest.timer
[Unit]
Description=Runs timertest service after delayed start

[Timer]
# Realtime timer (absolute scheduling)
# Time stamp of delayed start time for service
OnCalendar=2018-04-28 22:05:43
# Triggers the service immediately if it missed the last start time upon restart/crash
Persistent=true

# Monotonic timer (relative scheduling)
# Time to wait after booting before first run
#OnBootSec=10min
# Time between running each consecutive time
#OnUnitActiveSec=1h

Unit=timertest.service

[Install]
WantedBy=multi-user.target

Starting and Enabling Timer + Service Units

Start the timer process:

systemctl start timertest.timer

Enable the timer process to start on boot (every time OS boots automatically):

systemctl enable timertest.timer

Solution Attempt 1

Monotonic timers (relative scheduling) allow to easily specify a relative offset for the start time of the service to execute. However, it only keeps the timer for the active OS instance. In other words, if crash/reboot occurs, the timer is not persisted and executed on reboot if such timer count down would have theoretically elapsed.

Realtime timers (absolute scheduling) require use of a date/time stamp of the exact moment wishing to start the service. This means no relative offset syntax is provided, albeit a lot of syntax exists to specify time points in a reoccuring pattern, e.g. weekly → once a week at 12:00am on Monday, Mon,Tue *-*-01..04 12:00:00 → first four days of each month at 12:00 PM, but only if that day is a Monday or a Tuesday, Sat *-*-1..7 18:00:00 → Saturday of every month. So, we need to somehow be able to generate a time stamp using something like date util with a provided offset and pipe that to the OnCalendar property of the timer unit.

Transient Timer Units

Timer Units using systemd-run

One can use systemd-run to create transient timer units. That is, one can set a command to run at a specified time without having a service file. For example the following command touches a file after 30 seconds:

systemd-run --on-active=30 /bin/touch /tmp/foo

One can also specify a pre-existing service file that does not have a timer file. For example, the following starts the systemd unit named *someunit*.service after 12.5 hours have elapsed:

systemd-run --on-active="12h 30m" --unit someunit.service

Solution Attempt 2

Since the transient timer units allow timer properties to be specified via cmd-line, this would seem suitable for us considering we must make use of command substitution for the date util.

In one terminal, open the journalctl and follow stream:

journalctl -f

In another terminal, run the systemd-run command with 5 min delayed service start:

systemd-run --on-calendar="$(date +"%F %H:%M:%S" -d "+5 min")" /bin/bash /home/parallels/Desktop/timertest.bash -—persistant=true

You'll get something like this in same terminal you launched command:

Running timer as unit run-rb0a2feec078a4a8098a9a5f7020a653e.timer.
Will run service as unit run-rb0a2feec078a4a8098a9a5f7020a653e.service.

Whereas, in the journalctl window, you'll see this:

Apr 28 23:46:40 ubuntu systemd[1]: Started /bin/bash /home/parallels/Desktop/timertest.bash --persistant=true.

You can check transient timer created by listing all the currently active timers:

systemctl list-timers

Which should give you something like this:

NEXT                        LEFT            LAST                        PASSED          UNIT
Sat 2018-04-28 23:51:40 EDT 5min left       n/a                         n/a             run-rb0a2feec078a4a8098a9a5
n/a                         n/a             Sat 2018-04-28 22:05:48 EDT 1h 41min ago    timertest.timer
Sun 2018-04-29 01:16:45 EDT 1h 29min left   Sat 2018-04-28 18:21:49 EDT 5h 25min ago    snapd.refresh.timer

Result in journalctl feed should look like this once delayed time has passed:

Apr 28 23:51:40 ubuntu systemd[1]: Started /bin/bash /home/parallels/Desktop/timertest.bash --persistant=true.
Apr 28 23:51:40 ubuntu bash[4512]: Timertest was launched at 23:51:40

Unfortunately, even though the Persistence property was set to true, upon doing a simple sudo reboot, the transient timer unit is lost according to the list-timers output and no echoing was made according to the journalctl contents. Thus, it seems like the transient approach above does not allow for automatic recovery upon system shutdown/crash. Need to use approach involving using a service and timer unit file after all then...

Problem is, to be able to use the OnCalendar property (realtime timer), and use it as a delayed start, we need to specify the OnCalendar property dynamically as an offset of some x-value of time from the moment the timer unit is started/launched. From what I could tell (man pages & by trying every possible way to express command expansions), unit files don't allow command expansions to use the date command with an offset. So, the best way to make a timer unit file contain the absolute time desired to delay start, a bash script that re-writes to the timertest.timer file each time its launched with the dynamic OnCalendar date time stamp value generated dynamically. A bit of a hack job, but easy to implement and versatile.

Solution Attempt 3

Start by creating a sudo session to not have to repeatedly enter password:

sudo -i

Create the execution script to dynamically generate the delayed start time stamp inside the .timer unit file:

#!/bin/bash
#/home/parallels/Desktop/tmp.bash
#e.g: bash /home/parallels/Desktop/tmp.bash timertest.timer
> /etc/systemd/system/$1
echo [Unit] >> /etc/systemd/system/$1
echo Description=Runs timertest script after delayed start >> /etc/systemd/system/$1

echo [Timer] >> /etc/systemd/system/$1
# Realtime timer
# Use date util with provided offset to dynamically generate absolute date/time stamp required for OnCalendar property value
echo OnCalendar=$(date +"%F %H:%M:%S" -d "+5 min") >> /etc/systemd/system/$1
# Triggers the service immediately if it missed the last start time
echo Persistent=true >> /etc/systemd/system/$1
#AccuracySec=1
echo Unit=timertest.service >> /etc/systemd/system/$1

echo [Install] >> /etc/systemd/system/$1
echo WantedBy=multi-user.target >> /etc/systemd/system/$1

# Need to reload daemon and restart unit whenever file is modified
systemctl daemon-reload
systemctl restart $1
#systemctl start $1
#systemctl enable $1
#systemctl list-timers

Run the above script using:

bash /home/parallels/Desktop/tmp.bash timertest.timer

Observe output in journalctl:

Apr 29 00:37:12 ubuntu systemd[1]: Reloading.
Apr 29 00:37:13 ubuntu systemd[1]: Stopped Runs timertest script after delayed start.
Apr 29 00:37:13 ubuntu systemd[1]: Stopping Runs timertest script after delayed start.
Apr 29 00:37:13 ubuntu systemd[1]: Started Runs timertest script after delayed start.
Apr 29 00:37:13 ubuntu systemd[1]: Started CUPS Scheduler.
Apr 29 00:37:13 ubuntu systemd[1]: Started ACPI event daemon.

Observe list-timers output:

NEXT                        LEFT       LAST                        PASSED        UNIT
Sat 2018-04-29 00:42:13 EDT 5min left  Sat 2018-04-28 22:05:48 EDT 2h 33min ago  timertest.timer
...

Once time elapsed, observe journalctl contents:

Apr 29 00:42:13 ubuntu systemd[1]: Started TimerTest.
Apr 29 00:42:13 ubuntu bash[14681]: timertest was launched at 00:42:13

If you run bash /home/parallels/Desktop/tmp.bash timertest.timer and perform a sudo reboot, upon rebooting and typing systemctl list-timers, you'll see the timer unit intact and counting down as though it had never shut down or been otherwise disrupted. If upon reboot the time has passed the time set to start, it is immediately started.

Useful commands

Starting and Stopping Services

sudo systemctl start application.service
sudo systemctl stop application.service

Restarting and Reloading

sudo systemctl restart application.service
sudo systemctl reload application.service
sudo systemctl reload-or-restart application.service

Enabling and Disabling Services

sudo systemctl enable application.service
sudo systemctl disable application.service

Checking the Status of Services

systemctl status application.service
systemctl is-active application.service
systemctl is-enabled application.service
systemctl is-failed application.service

Listing Current Units

systemctl list-units --type=service

Displaying a Unit File

systemctl cat atd.service

Displaying Dependencies

systemctl list-dependencies sshd.service

Checking Unit Properties

systemctl show sshd.service

Masking and Unmasking Units

sudo systemctl mask nginx.service
sudo systemctl unmask nginx.service

Editing Unit Files

sudo systemctl edit nginx.service
sudo systemctl edit --full nginx.service

sudo rm -r /etc/systemd/system/nginx.service.d
sudo rm /etc/systemd/system/nginx.service

Reload the systemd process

sudo systemctl daemon-reload

To halt the system, you can use the halt command:

sudo systemctl halt

To initiate a full shutdown, you can use the poweroff command:

sudo systemctl poweroff

A restart can be started with the reboot command:

sudo systemctl reboot

These all alert logged in users that the event is occurring, something that simply running or isolating the target will not do. Note that most machines will link the shorter, more conventional commands for these operations so that they work properly with systemd.

For example, to reboot the system, you can usually type:

sudo reboot