Play apps crashed, monit silently unmonitors

dr0i commented 3 years ago

Today we discovered that the play app of lobid-resources at quaoar1 and weywot2 couldn't be launched and was silently unmonitored, resulting in a downtime of lobid-resources and consecutive apps like nwbib.

One issue is that while the restart.sh script already removes the RUNNING_PID the monit_restart.sh script does not. As monit uses monit_restart.sh play refuses to restart the web app because the RUNNING_PID still exists sometimes (even when the app is crashed). So the solution is to to

[x] remove the RUNNING_PID from the monit_restart.sh

It would also be nice to

[x] inform via email when monit unmonitors a process

(We don't need to be informed when monit restarts a process because that is done once a month (via crontab) for almost all web apps and is not a problem in itself because the High Available Proxy of Apache redirects to the spare server. We don't want to get too many emails because that would be too noisy.)

dr0i commented 3 years ago

Discovered again many defunct processes ("zombies", check with "$ ps -el |grep Z") of "monit_restart.sh" (14 zombies) and killed them by killing the parent ("monit").

increased the set daemon in /etc/monit/monitrc from 10 to 20, making a "cycle" last 20 seconds.
increased the startup-timeout from 90 seconds to 120.
add alert lobid-admin if timeout occurs on lobid-resources, lobid-gnd, nwbib etc.

Reloaded monit and restarted monit.

Commited and pushed /etc/monit/conf.d/play-instances.rc.

dr0i commented 3 years ago

Changed config to e.g..

  if failed host 127.0.0.1 port 8000 then restart
  if 5 restarts within 15 cycles then timeout
         alert $mailaddress only on { timeout }

this seems to prevent to write too many mails.

dr0i commented 3 years ago

Closing.

hbz / lobid

Play apps crashed, monit silently unmonitors #465