Open kelson42 opened 5 years ago
Something like https://icinga.com/
I've added uptimerobot for the new wp1.openzim.org URL.
We could run a cron job on the workers image that checks how many items are in the FAILED queue and if it gets over N sends us an email?
@audiodude Thank you for commenting.I think there is many things to add on this tickets on my side.
First, we have a Kiwix uptimerobot.com account (was a good advice of you) entry for wp1.openzim.org. I believe this is important the monitoring is centralised for our services. If OK for you I would like to add you as recipient.
Then, I think the problem this ticket is complex enough to have a multilateral technical answer:
The less subject to discussion is the monitoring of the solution to secure that basically the solution is currently running fine. Just monitoring HTTP 200 on https://wp1.openzim.org seems too short and I believe this is something we could all agree on. I believe the subsystems should be tested properly and we should follow the same path here like for the Kiwix Hotspot Cardshop, see https://github.com/kiwix/cardshop/issues/114 (https://cardshop.hotspot.kiwix.org/health-check).
Then we have the problem of having an overall working application but buggy edge cases, typically a Web page/API called with certain parameters which makes the request crashing (but not the whole daemon). For the moment and AFAIK, this will go unoticed as long as a user does not complain. IMO this is not super critical but we should better be able to detect that before someone complains. If the applications errors are caught/handled properly, this might be probably solved via the audience measurement tool, see https://github.com/openzim/wp1/issues/248. Fypically we could have a look to the HTTP 5XX errors.
Finally we have the case of the analysis of the problems (a call stack trace for example). This can be really done with only a tool linked to the application logs. I'm not sure we really need one (this could be done manually). But, if you want to do it and after thinking twice about it, I'm a bit relunctant to build a wp1.openzim.org dedicated solution, we should more go in the direction of a cloud based solution we could use for all our applications. That approach would be more interesting I believe.
Okay at the very least I'll delete my uptimerobot entry and you can add me to yours.
We have already a minimal monitoring (using uptimerobot) based on the simple availability of a web page. But using this we can not detect:
For this we need a solution able to monitor the logs of the applicationn.