hbz / lobid

Linking Open Bibliographic Data
https://lobid.org/
Eclipse Public License 2.0
15 stars 4 forks source link

Play apps crashed, monit silently unmonitors #465

Closed dr0i closed 3 years ago

dr0i commented 3 years ago

Today we discovered that the play app of lobid-resources at quaoar1 and weywot2 couldn't be launched and was silently unmonitored, resulting in a downtime of lobid-resources and consecutive apps like nwbib.

One issue is that while the restart.sh script already removes the RUNNING_PID the monit_restart.sh script does not. As monit uses monit_restart.sh play refuses to restart the web app because the RUNNING_PID still exists sometimes (even when the app is crashed). So the solution is to to

It would also be nice to

(We don't need to be informed when monit restarts a process because that is done once a month (via crontab) for almost all web apps and is not a problem in itself because the High Available Proxy of Apache redirects to the spare server. We don't want to get too many emails because that would be too noisy.)

dr0i commented 3 years ago

Discovered again many defunct processes ("zombies", check with "$ ps -el |grep Z") of "monit_restart.sh" (14 zombies) and killed them by killing the parent ("monit").

Reloaded monit and restarted monit.

Commited and pushed /etc/monit/conf.d/play-instances.rc.

dr0i commented 3 years ago

Changed config to e.g..

  if failed host 127.0.0.1 port 8000 then restart
  if 5 restarts within 15 cycles then timeout
         alert $mailaddress only on { timeout }

this seems to prevent to write too many mails.

dr0i commented 3 years ago

Closing.