openflighthpc / concertim-ansible-playbook

Ansible playbook for building a Concertim appliance
Eclipse Public License 2.0
0 stars 0 forks source link

Add monit, scripts and config #13

Closed benarmston closed 1 year ago

benarmston commented 1 year ago

Add monit to monitor our daemons. The monit configurations have largely been taken from concertim without change. We monitor that certain processes exist; that models are published over DRb connections; and that log files are getting written to.

Some daemons have also been given resource constraints, which may or may not be set to appropriate values.

Utility scripts to help with the monitoring and clobbering of processes have also been added.

monit is not started until MIA is fully configured. If we wish to build vanilla appliance ahead of their configuration, this may need rethinking.

TODO:

benarmston commented 1 year ago

Adding monit support for emma and mia is more difficult than for the other rails servers / daemons. Both mia and emma have multiple instances of their servers and we wish to monitor and potentially restart them individually. For instance is emma on port 9900 is consuming too much CPU or RAM, we want to restart only that instance of emma.

However, our current systemd unit files are auto generated from the /etc/init.d/ files and don't support individual starting and restarting of instances. We could have monit restart the services directly instead of via systemctl, however doing that causes systemd to "lose track" of the services. For instance, in the below output, monit has restarted the emma 9900 instance, we can see that it is running from the output of pgrep however the output of systemctl status emma shows that systemd has "lost track of it".

root@command1:/etc/monit/conf.d# pgrep -a thin
2657572 thin server (127.0.0.1:9901)                                                                                                                                                                                                                                                                                                                                      
2657583 thin server (127.0.0.1:9902)                                                                                                                                                                                                                                                                                                                                      
2657962 thin server (127.0.0.1:9900)                                                                                                                                                                                                                                                                                                                                      
root@command1:/etc/monit/conf.d# 
root@command1:/etc/monit/conf.d# 
root@command1:/etc/monit/conf.d# systemctl status emma
● emma.service - LSB: EMMA - the enhanced monitoring and management architecture
     Loaded: loaded (/etc/init.d/emma; generated)
     Active: active (running) since Fri 2022-10-14 16:52:25 UTC; 2 days ago
       Docs: man:systemd-sysv-generator(8)
    Process: 2657546 ExecStart=/etc/init.d/emma start (code=exited, status=0/SUCCESS)
      Tasks: 67 (limit: 3530)
     Memory: 342.8M
        CPU: 25min 31.166s
     CGroup: /system.slice/emma.service
             ├─2657572 "thin server (127.0.0.1:9901)" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ">
             └─2657583 "thin server (127.0.0.1:9902)" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ">

Oct 14 16:52:22 command1 systemd[1]: Starting LSB: EMMA - the enhanced monitoring and management architecture...
Oct 14 16:52:22 command1 emma[2657546]:  * Starting Phoenix EMMA
Oct 14 16:52:23 command1 emma[2657551]: Starting server on 127.0.0.1:9900 ...
Oct 14 16:52:23 command1 emma[2657551]: Starting server on 127.0.0.1:9901 ...
Oct 14 16:52:24 command1 emma[2657551]: Starting server on 127.0.0.1:9902 ...
Oct 14 16:52:25 command1 emma[2657546]:    ...done.
Oct 14 16:52:25 command1 systemd[1]: Started LSB: EMMA - the enhanced monitoring and management architecture.
root@command1:/etc/monit/conf.d# 

I'm not sure what the consequences of this are. systemctl restart emma may or may not restart the correct services reliably. This may depend on the exact implementation of /etc/init.d/emma. I am sure that this is the wrong way to use systemd unit files.

A potential solution would be to have individual unit files for each emma service and either make them PartOf an "emma group" service or WantedBy an "emma group" target. This solution will also work for mia and is probably desirable to have for mongrel_rails (aka the phoenix modules) too.

I think implementing that solution is outside the scope of this initial pass at getting monit running. The services most likely to fail are being monitored.