Closed mandrigin closed 5 years ago
Prometheus(grafana metrics store) already has real-time notifying: https://github.com/status-im/infra-hq/blob/master/ansible/roles/prometheus-master/templates/rules.yml.j2 They show up here: https://alerts.status.im/ And then they are sent to PagerDuty.
So if we specify some kind of threshold for the metrics we collect we can alarm on it. How soon do we need this? Kinda swamped with high priority work, like implementing search for our main website.
Well, that is related to the message delivery ratio in Status, so, I think it is very important.
So far, I don't want a threshold because I don't know what the values are and what the dispersion would be, I wanted to keep it in the "monitor and watch" state for a week or something.
Okay, I'll poke @adambabik about how to deploy this properly.
Usage:
$ ./bin/x-check-mailserver \
-p status \
-p other-channel \
-m enode://7de99e4cb1b3523bd26ca212369540646607c721ad4f3e5c821ed9148150ce6ce2e72631723002210fac1fd52dfa8bbdf3555e05379af79515e1179da37cc3db@35.188.19.210:30504 \
-m enode://...
It's not persistent, run one per fleet periodically (every 10m sounds fine) and provide all mail servers within a single fleet.
I will push a docker image as well.
Okay, we have it deployed at canary.status.im: https://canary.status.im/icingaweb2/monitoring/list/services?servicegroup_name=x-check&sort=service_severity#!/icingaweb2/monitoring/service/show?host=X-Check%20Mailserver&service=eth.test%3Astatus-core
But it has 2 issues:
crit
settings and make the page hard to parse.@jakubgs I just have fixed logs. Check out master branch: https://github.com/status-im/statusd-bots
Nice, thanks! Now it looks much better: https://canary.status.im/icingaweb2/monitoring/list/services?service_state=0&host=X-Check%20Mailserver
@jakubgs @adambabik can we close this issue then? Is there anything that is left?
I think we can call it done. We do not push any stats to Grafana because there aren't any but email will be sent if any check fails.
We need a way to check the health of our mailservers.
Adam created a script that checks the consistency of the messages stored on the mailservers: https://github.com/status-im/statusd-bots/pull/7
What we need to do?
Later, we can setup a real-time notifying system, but it is still better than nothing.