Tech Leads can't tell when Sentinel is down

medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.

https://communityhealthtoolkit.org

GNU Affero General Public License v3.0

438 stars 209 forks source link

Tech Leads can't tell when Sentinel is down #2111

Closed estellecomment closed 8 years ago

estellecomment commented 8 years ago

Sentinel goes down sometimes, but you can't necessarily tell, you can just see that the app is behaving weird and try to see if rebooting Sentinel helps. When you're new to the job and you don't know this, it can waste a loooot of your time.

SCdF commented 8 years ago

I wonder how easy it would be for Sentinel (or a cron on the same box) to detect that it's "down" (this is obv. highly dependent on what "down" ends up being caused by) and reboot itself. This is a dirty hack, but it might be a good stop-gap between now and making Sentinel more reliable.

estellecomment commented 8 years ago

Priority : 5/5 @bishwas-medic has an instance where the messages were stuck for 1.5 months because of this. The partner took a long time to report it, probably because CHWs took a long time to report it. It's probably the second biggest source of reported issues after SMSSync (~15%).

Auto reboot is a quick fix, + reporting when it goes down, so that we have stats about this.

mandric commented 8 years ago

My current way to deal with sentinel problems is to mark records with errors when sentinel fails doing somethine. If that can't be done then there should be decent clues in the logs. If sentinel is failing (typically transitions) and not leaving an error trail on the records it's failing to process (or it's not visible in the UI) then that is a sentinel or potentially webapp bug. If sentinel is failing harder than that and not leaving a trail in the logs then that is a sentinel bug.

estellecomment commented 8 years ago

Sentinel is restarted by gardener when it crashes, so wrote a script that looks for the restart message, and emails alerts if there are too many restarts lately.

Next : cron job to run it periodically. Also next : how to monitor the sentinel monitor!!

estellecomment commented 8 years ago

All right! All done, nice logging, nice start and kill script, much beautifulness.

Assigning to @mandric for review, and @bishwas-medic I'd love to have your input as well, since you've gone through sentinel-trouble.

is it understandable?
does it work?
does it need anything else?
what values would make sense for the number of events that trigger an email?
what recipient email?
Are you offended by being called "chickens"? I would have gone with baby goats, since they're cross-culturally considered cute, but there's emoji for adult goats only, which are less cute.

mandric commented 8 years ago

Sorry for the delay here. PR + squash in the future please! Also:

Every good commit should be able to complete the following sentence: When applied, this commit will: {{YOUR COMMIT MESSAGE}}

See for more tips: http://www.alexkras.com/19-git-tips-for-everyday-use/#good-commit-message

mandric commented 8 years ago

I would probably create this as a separate project like medic-sentinel-monitor or even medic-monitors if we are going to need to monitor more than sentinel.

mandric commented 8 years ago

This really has little to do with webapp and probably should have its own home.

mandric commented 8 years ago

Would you move it for me and then I will review again?

estellecomment commented 8 years ago

Here goes! https://github.com/medic/medic-monitoring/tree/master/sentinel_monitor

mandric commented 8 years ago

Awesome! Ok now how do you feel about removing the nodejs based cron (sentinel_monitor_launcher.js) and using a simple unix cron entry instead? Just include some instructions on setting up an example cronjob in the readme? Following the "less code is better" principle here...

estellecomment commented 8 years ago

Done @mandric !

estellecomment commented 8 years ago

Figuring out with @browndav where the best place to put it is. Right now it's in vm user's home, which is probably not the best.

estellecomment commented 8 years ago

Messed up the issue linking again, that hashtagging is too complex for me. Commit here: https://github.com/medic/medic-monitoring/commit/af1572a49da375bf2fb527f67c87e3c4af3578b5

estellecomment commented 8 years ago

Sentinel Monitor has been running on srhgorkha for a week and seems to do well, though not much happened so it's hard to tell. Closing, and I'll reopen if bugs show up.

mandric commented 8 years ago

Great!

ngaruko commented 7 years ago

@bishwas-medic Just checking, is this working on your side? Are you able to Acceptance Test this?

bishwasBhatta commented 7 years ago

@ngaruko I had installed Estelle's sentinel monitoring script (https://github.com/medic/medic-monitoring) on one of our live project instances and it seemed to have been working without issues so far. I haven't received any email alerts for sentinel going down till now.

There were talks of integrating the monitoring script on our standard webapp package so that it is activated automatically whenever a new webapp instance was launched. I have no idea about the progress on that front though.

ghost commented 7 years ago

I took this as far as establishing and documenting https://github.com/medic/medic-os-simplest-possible-package; in theory it should be as simple as dropping the script in there, and probably making a minor tweak to the env to figure where node is (both to be safer than depending on a pre-set $PATH, and due to https://github.com/medic/medic-webapp/issues/2750 being in the current iteration and not shipped yet).

ghost commented 7 years ago

I know there were also a few challenges around sending e-mail from AWS that @mandric had briefly looked in to. I'm wondering if finding a way to use the script in a slightly different way, i.e. to ask the service supervisor what the status is (e.g. from an external monitoring service like we're using for HTTP status), essentially pull vs. push. Could be easier than a long-running service, and being able to remotely query the status of individual services could be useful in general (for us, only).

ngaruko commented 7 years ago

Hi @browndav . I am marking this as Ready but would suggest you file an issue with the above suggestions for improvement.