j2kun / riemann-divisor-sum

Code for the series "Searching for Riemann Hypothesis Counterexamples"
https://jeremykun.com/2020/09/11/searching-for-rh-counterexamples-setting-up-pytest/
21 stars 2 forks source link

Add alerting when processor jobs fail #20

Closed j2kun closed 3 years ago

j2kun commented 3 years ago

I'm not quite sure how I want to do this yet, but creating this issue as a placeholder.

j2kun commented 3 years ago

Maybe try alerta?

https://docs.alerta.io/en/latest/quick-start.html https://github.com/alerta/docker-alerta

Seems free, could deploy as an HTTP server on the generator server, since that one generally doesn't do much. Has a docker container ready.

j2kun commented 3 years ago

There's one called Riemann. So tempting

http://riemann.io/

j2kun commented 3 years ago

I think I'll try something simpler first: just write a script that runs docker ps -a and parses the output, sends an email if anything died.

j2kun commented 3 years ago

Create an app password through Google Account (for gmail) install ssmtp configure (/etc/smtp/smtp.conf) use environment variables to hide secrets python script to run the monitor, nohup to detach it from the terminal

j2kun commented 3 years ago

Set up the monitoring on each of the four EC2 instances. The processor nodes had ran out of RAM, so I restarted them and expect them to fail again soon, which I can then use to test the alerting system.

j2kun commented 3 years ago

Looks like the processor jobs are still going after restart, so something unexpected caused them to fail... not sure, but will close this for now and see if the alerting works later