patchew-project / patchew

A patch email tracking and testing system
MIT License
73 stars 24 forks source link

Health check infrastructure #58

Open famz opened 6 years ago

famz commented 6 years ago

Multiple nodes and many services on them work together. Consequently, it may take a few days before any issue being noticed when one component breaks.

Add a health check mechanism so we can monitor the status of the webserver, the importer (applier), and all testers, and perhaps even the IMAP accounts we use, in a flexible way.

A previous attempt was to use Ansible Tower, together with auto-deploy. Two main problems:

  1. It's not open source.
  2. It doesn't have the git/github trigger for auto-deployment. Only a recurring trigger is supported.

Ideas to use other dev-ops software are also welcome!

Alternatively, we can build this functionality into the server code (mods/healthcheck.py, for example), and rely on https://uptimerobot.com/ to monitor its availability.

famz commented 6 years ago

@bonzini Any input? :)

bonzini commented 6 years ago

Not really an expert and I didn't think much about it, but I would rather not reinvent the wheel. There must be some open source components to do this... What do you mean by auto-deploy?

famz commented 6 years ago

Auto-deploy is a different story: I want some way to install new code to next.patchew.org whenever we push to github master.

I agree we should better use a library, I know little about this area too and this needs some research. However adding a mods/healthcheck.py is not necessarily reinventing a wheel: even to integrate with something like Prometheus, this may be required work.

What I really mean by "alternative" is we can instead keep this as simple as a few lines of code:

bonzini commented 6 years ago

I think Prometheus or Uptime Robot would be integrated at the Django level? Something like https://github.com/korfuri/django-prometheus or https://github.com/uncommitted-and-forgotten/django-uptimerobot. You might need a plugin to add patchew-specific counters, but in general DB and web server access should be handled outside Patchew.

Would you run the Prometheus (or Uptime Robot) server on the tester and importer too? Perhaps the tester should be moved out of patchew-cli and into a separate script (and the REST API issue also talks about Patchew "pushing" series to testers via webhooks, instead of the other way round).