Monitoring of infrastructure with alerts

hellais commented 7 years ago

We currently have some basic monitoring setup here: https://munin.ooni.io/.

However it's currently not sending us alerts when something bad happens, which I guess is an improvement, but still not perfect.

What is needed in order to have us receive alerts when something unexpected happens on the infrastructure? I believe the issue lies in the fact that our @oo accounts don't receive the emails unless they are sent from a legit SMTP server.

What about setting up a gmail account for notifications and use that?

Alternatively we could go for https://github.com/TheTorProject/ooni-sysadmin/issues/53, but I suspect that is going to be more laborious and also what happens if email server goes down, who is going to alert us of that?

SuperQ commented 7 years ago

I could help with setting up Prometheus alerting.

hellais commented 7 years ago

I could help with setting up Prometheus alerting.

We currently have a munin based monitoring system. How much more complex is it to use something like prometheus?

SuperQ commented 7 years ago

Prometheus is a little bit of up-front work, but we have Ansible scripts to deal with most of it.

But good alerting is the primary goal of Prometheus. So the long-term gains are worth it.

darkk commented 7 years ago

@SuperQ have you used blackbox_exporter to do active probing as well? Can you mention any pitfalls with prometheus?

SuperQ commented 7 years ago

I can talk about Prometheus at length.

Yes, I have done a number of active blackbox probe setups.

I think the biggest pitfall of Prometheus, is the fact that it is a general-use monitoring system, and has a high degree of flexibility. This flexibility requires a bit of a learning curve to get used to.

The only other thing that requires a bit of planning is that Prometheus collects data over a number of different HTTP ports. It requires a bit of planning depending on the network design. Prometheus targets are simple HTTP get endpoints, with no writeable API. Because of this simple design, security was left "up to the user". The target design for use on private networks, not on public IP hosted servers.

hellais commented 7 years ago

Some notes related to issues with sending email notifications:

Having a relay for @infra.ooni.io is okayish, having @openobservatory.org is probably not a good idea right now
Connection to smtp server for sending emails can be unreliable (if smtp server is in AMS and airflow is sending emails from HK) so it's maybe better to have a local email (or inside of the same location) queue and that will handle delivering email.

hellais commented 7 years ago

@SuperQ what is the recommended way of implementing authentication and encryption for scraping targets? https://prometheus.io/docs/introduction/faq/#why-don-t-the-prometheus-server-components-support-tls-or-authentication-can-i-add-those links to a post talking about putting a nginx proxy with http basic auth, is that how you would do it?

SuperQ commented 7 years ago

That's one way to do it. But, in reality, it's easier to just firewall the metrics ports off the internet. There is no need to authenticate or encrypt the metrics traffic for the most part. The metrics endpoints are extremely simple, are read-only, and very light weight.

darkk commented 7 years ago

My opinion is that our network is not 100% trusted as the subnet is shared with other projects and docker may unexpectedly mess with iptables rules, so I think that both firewall and some frontend with PSK are useful if we want to hide the surface from possible attacker.

hellais commented 7 years ago

Yes I agree with @darkk that we probably want some extra protection layer as the local network is not really trusted. We also have machines that we care to monitor residing in different datacenters and the traffic would be traversing the public internet.

@SuperQ would you be up to helping us setup a prometheus instance?

SuperQ commented 7 years ago

For the case of multiple datacenters, the typical design is to have a local installation of Prometheus. This way you are only tunneling one pipe of traffic to/from each Prometheus server, and not exposing all targets to a remote server.

Frankly, the attack surface of nginx + OpenSSL is far greater than what an exporter target provides. Simple firewalling is generally enough. But if you really want to go down the route, I recommend using client cert auth and not basic auth. It's easier to generate and secure.

Prometheus is very easy to setup, I'm happy to help with this. I already have ansible roles to deal with this.

hellais commented 7 years ago

Prometheus is very easy to setup, I'm happy to help with this. I already have ansible roles to deal with this.

Are you on jabber? I am as hellais@jabber.ccc.de. Edit: we also have an OONI IRC or slack (https://slack.openobservatory.org/)

darkk commented 6 years ago

Basic deployment of Prometheus is done long time ago and we're quite happy with Prometheus. But lots of leftovers are still there, so further cleanup is needed and way more signals have to be scraped.

ooni / sysadmin

Monitoring of infrastructure with alerts #93