Closed dr0i closed 3 years ago
Discovered again many defunct processes ("zombies", check with "$ ps -el |grep Z") of "monit_restart.sh" (14 zombies) and killed them by killing the parent ("monit").
set daemon
in /etc/monit/monitrc
from 10 to 20, making a "cycle" last 20 seconds.Reloaded monit and restarted monit.
Commited and pushed /etc/monit/conf.d/play-instances.rc
.
Changed config to e.g..
if failed host 127.0.0.1 port 8000 then restart if 5 restarts within 15 cycles then timeout alert $mailaddress only on { timeout }
this seems to prevent to write too many mails.
Closing.
Today we discovered that the play app of lobid-resources at quaoar1 and weywot2 couldn't be launched and was silently unmonitored, resulting in a downtime of lobid-resources and consecutive apps like nwbib.
One issue is that while the restart.sh script already removes the RUNNING_PID the monit_restart.sh script does not. As monit uses monit_restart.sh play refuses to restart the web app because the RUNNING_PID still exists sometimes (even when the app is crashed). So the solution is to to
It would also be nice to
(We don't need to be informed when monit restarts a process because that is done once a month (via crontab) for almost all web apps and is not a problem in itself because the High Available Proxy of Apache redirects to the spare server. We don't want to get too many emails because that would be too noisy.)