Monitor and report when mailservers are out of sync with each other

status-im / infra-eth-cluster

Infrastructure for Status-go fleets

https://github.com/status-im/status-go

0 stars 0 forks source link

Monitor and report when mailservers are out of sync with each other #20

Closed mandrigin closed 5 years ago

mandrigin commented 6 years ago

We need a way to check the health of our mailservers.

Adam created a script that checks the consistency of the messages stored on the mailservers: https://github.com/status-im/statusd-bots/pull/7

What we need to do?

[ ] Run this script every 10 minutes
[ ] Report the data about mailservers to grafana
[ ] Send an email once a day to devops, igor@status.im, adam@status.im with the compiled report of these runs per day.

Later, we can setup a real-time notifying system, but it is still better than nothing.

jakubgs commented 6 years ago

Prometheus(grafana metrics store) already has real-time notifying: https://github.com/status-im/infra-hq/blob/master/ansible/roles/prometheus-master/templates/rules.yml.j2 They show up here: https://alerts.status.im/ And then they are sent to PagerDuty.

So if we specify some kind of threshold for the metrics we collect we can alarm on it. How soon do we need this? Kinda swamped with high priority work, like implementing search for our main website.

mandrigin commented 6 years ago

Well, that is related to the message delivery ratio in Status, so, I think it is very important.

So far, I don't want a threshold because I don't know what the values are and what the dispersion would be, I wanted to keep it in the "monitor and watch" state for a week or something.

jakubgs commented 6 years ago

Okay, I'll poke @adambabik about how to deploy this properly.

adambabik commented 6 years ago

Usage:

$ ./bin/x-check-mailserver \
    -p status \
    -p other-channel \
    -m enode://7de99e4cb1b3523bd26ca212369540646607c721ad4f3e5c821ed9148150ce6ce2e72631723002210fac1fd52dfa8bbdf3555e05379af79515e1179da37cc3db@35.188.19.210:30504 \
    -m enode://...

It's not persistent, run one per fleet periodically (every 10m sounds fine) and provide all mail servers within a single fleet.

I will push a docker image as well.

jakubgs commented 6 years ago

Okay, we have it deployed at canary.status.im: https://canary.status.im/icingaweb2/monitoring/list/services?servicegroup_name=x-check&sort=service_severity#!/icingaweb2/monitoring/service/show?host=X-Check%20Mailserver&service=eth.test%3Astatus-core

But it has 2 issues:

Logs are too verbose even on crit settings and make the page hard to parse.
There needs to be a port setting, otherwise the different processes will clash.

jakubgs commented 6 years ago

Configured in: https://github.com/status-im/infra-hq/commit/1989df8cfa200c1dd85436ee184e04de1caab04e

adambabik commented 6 years ago

@jakubgs I just have fixed logs. Check out master branch: https://github.com/status-im/statusd-bots

jakubgs commented 5 years ago

Nice, thanks! Now it looks much better: https://canary.status.im/icingaweb2/monitoring/list/services?service_state=0&host=X-Check%20Mailserver

mandrigin commented 5 years ago

@jakubgs @adambabik can we close this issue then? Is there anything that is left?

adambabik commented 5 years ago

I think we can call it done. We do not push any stats to Grafana because there aren't any but email will be sent if any check fails.