Closed estellecomment closed 8 years ago
I wonder how easy it would be for Sentinel (or a cron on the same box) to detect that it's "down" (this is obv. highly dependent on what "down" ends up being caused by) and reboot itself. This is a dirty hack, but it might be a good stop-gap between now and making Sentinel more reliable.
Priority : 5/5 @bishwas-medic has an instance where the messages were stuck for 1.5 months because of this. The partner took a long time to report it, probably because CHWs took a long time to report it. It's probably the second biggest source of reported issues after SMSSync (~15%).
Auto reboot is a quick fix, + reporting when it goes down, so that we have stats about this.
My current way to deal with sentinel problems is to mark records with errors when sentinel fails doing somethine. If that can't be done then there should be decent clues in the logs. If sentinel is failing (typically transitions) and not leaving an error trail on the records it's failing to process (or it's not visible in the UI) then that is a sentinel or potentially webapp bug. If sentinel is failing harder than that and not leaving a trail in the logs then that is a sentinel bug.
Sentinel is restarted by gardener when it crashes, so wrote a script that looks for the restart message, and emails alerts if there are too many restarts lately.
Next : cron job to run it periodically. Also next : how to monitor the sentinel monitor!!
All right! All done, nice logging, nice start and kill script, much beautifulness.
Assigning to @mandric for review, and @bishwas-medic I'd love to have your input as well, since you've gone through sentinel-trouble.
Sorry for the delay here. PR + squash in the future please! Also:
Every good commit should be able to complete the following sentence: When applied, this commit will: {{YOUR COMMIT MESSAGE}}
See for more tips: http://www.alexkras.com/19-git-tips-for-everyday-use/#good-commit-message
I would probably create this as a separate project like medic-sentinel-monitor or even medic-monitors if we are going to need to monitor more than sentinel.
This really has little to do with webapp and probably should have its own home.
Would you move it for me and then I will review again?
Awesome! Ok now how do you feel about removing the nodejs based cron (sentinel_monitor_launcher.js) and using a simple unix cron entry instead? Just include some instructions on setting up an example cronjob in the readme? Following the "less code is better" principle here...
Done @mandric !
Figuring out with @browndav where the best place to put it is. Right now it's in vm user's home, which is probably not the best.
Messed up the issue linking again, that hashtagging is too complex for me. Commit here: https://github.com/medic/medic-monitoring/commit/af1572a49da375bf2fb527f67c87e3c4af3578b5
Sentinel Monitor has been running on srhgorkha for a week and seems to do well, though not much happened so it's hard to tell. Closing, and I'll reopen if bugs show up.
Great!
@bishwas-medic Just checking, is this working on your side? Are you able to Acceptance Test this?
@ngaruko I had installed Estelle's sentinel monitoring script (https://github.com/medic/medic-monitoring) on one of our live project instances and it seemed to have been working without issues so far. I haven't received any email alerts for sentinel going down till now.
There were talks of integrating the monitoring script on our standard webapp package so that it is activated automatically whenever a new webapp instance was launched. I have no idea about the progress on that front though.
I took this as far as establishing and documenting https://github.com/medic/medic-os-simplest-possible-package; in theory it should be as simple as dropping the script in there, and probably making a minor tweak to the env
to figure where node
is (both to be safer than depending on a pre-set $PATH, and due to https://github.com/medic/medic-webapp/issues/2750 being in the current iteration and not shipped yet).
I know there were also a few challenges around sending e-mail from AWS that @mandric had briefly looked in to. I'm wondering if finding a way to use the script in a slightly different way, i.e. to ask the service supervisor what the status is (e.g. from an external monitoring service like we're using for HTTP status), essentially pull vs. push. Could be easier than a long-running service, and being able to remotely query the status of individual services could be useful in general (for us, only).
Hi @browndav . I am marking this as Ready but would suggest you file an issue with the above suggestions for improvement.
Sentinel goes down sometimes, but you can't necessarily tell, you can just see that the app is behaving weird and try to see if rebooting Sentinel helps. When you're new to the job and you don't know this, it can waste a loooot of your time.