uschtwill/docker_monitoring_logging_alerting

+++ 2018/06/16 - Updated to ELK 6.3.0 and fixed a whole bunch of stuff... +++

Please let me know, if something is broken, so I can fix it!

If you have any feedback regarding this monitoring/logging/alerting suite, any ideas for improvement, fixes, questions or comments, please feel free to contact me or do a PR!

What is this?

Blog post on Medium with some more elaboration.

This is a secure out-of-the-box monitoring, logging and alerting suite for Docker-hosts and their containers, complete with dashboards to monitor and explore your host and container logs and metrics.

Monitoring: cAdvisor and node_exporter for collection, Prometheus for storage, Grafana for visualisation.

Logging: Filebeat for collection and log-collection and forwarding, Logstash for aggregation and processing, Elasticsearch as datastore/backend and Kibana as the frontend.

Alerting: elastalert as a drop-in for Elastic.io's Watcher for alerts triggered by certain container or host log events and Prometheus' Alertmanager for alerts regarding metrics.

Security: The whole suite can be run in secure mode, which places jwilder's nginx reverse proxy (with JrCs letsencrypt companion) in front of the suite. This reverse proxy then handles all traffic to and from the suite, forces https, fully automates initial SSL certificate issuance and renewal, provides basic auth for all dashboards and allows to forgo any port forwarding from the suite containers to the host machine.

Of course you can then also use this nginx reverse proxy with the exact same mechanism to manage traffic to and from your other containers like applications, databases, api endpoints and what have you.

grafana_screenshot The Grafana dashboard (a bit slimmed down) can also be found on grafana.net: https://grafana.net/dashboards/395.

kibana_screenshot

alerts_screenshot

How to set it up?

This repository comes with a storage directoriy for Grafana that contains the configuration for the data sources and the dashboard. This directoriy will be mounted into the containers as volumes. This is for your convenience and eliminates some manual setup steps.

+++

Note: With the update to ELK 6.3.0 the indices from the initial commits of the repository were causing errors. I thus removed them, so unfortunately there are no more convenience dashboards for Kibana anymore.

This also means, that you will have to set up the indices yourself in the beginning. But that will be an easy excercise, the new Kibana does a good job in assisting with that.

Also, give the whole stack a bit of time to start up. I noticed, that it takes up to 3 minutes for the first logs to arrive and for Kibana to suggest index patterns.

+++

git clone this repository: git clone https://github.com/uschtwill/docker_monitoring_logging_alerting.git
cd into the folder: cd docker_monitoring_logging_alerting
Check out the prerequisites in install-prerequisites.sh and make sure they're fulfilled (or just run the script if the host is a fresh machine).
Run the setup script setup.sh.

For `secure` mode run `sh setup.sh secure YOUR_DOMAIN VERY_STRONG_PASSWORD`

Actually, before running the script, quickly create subdomain A-record DNS entries for grafana.DOMAIN, kibana.DOMAIN, prometheus.DOMAIN and alertmanager.DOMAIN that point at the host that is going to run the suite (DOMAIN being your domain).
Provided with your domain and a very strong password, sh setup.sh secure YOUR_DOMAIN VERY_STRONG_PASSWORD will set up the suite in secure mode, effectively:
- running it with an nginx reverse proxy in front of it.
- cutting out all port-forwarding nonsense.
- downloading SSL certificates and keeping them up to date.
- providing basic auth for all dashboards and locking them down with HTTPS(-only).
- exposing dashboards at https://grafana.DOMAIN, https://kibana.DOMAIN, https://prometheus.DOMAIN and https://alertmanager.DOMAIN.
Run any containers with the same logging options as defined in this suite's docker-compose.ymland add a container_group label to enable monitoring, logging and alerting for them.
If you want to uninstall this suite completely, you can revert to the state before setting up by running the cleanup script: sh cleanup.sh secure.

For `unsecure` mode run `sh setup.sh unsecure`.

Enjoy and explore your logs and metrics:
- To explore your logs: localhost:5601/app/kibana#/discover.
- To explore your logging metrics: localhost:5601/app/kibana#/dashboard/Exploration.
- To see your most important container and host metrics at a glance: localhost:3000/dashboard/db/main-overview.
- To explore any metric that's collected without having to build queries: localhost:3000/dashboard/db/data-exploration.
- To see all monitoring alerts and their status in prometheus: localhost:9090/alerts.
- To manage your monitoring alerts (e.g. silence them) in Alertmanager: localhost:9093/#/alerts. Elastalert (logging alerts) unfortunately does not have a frontend.
- Just to see what the cAdvisor frontend looks like (you'll use Grafana for looking at monitoring metrics anyways): localhost:8080/containers/
- To say hello to your Elasticsearch instance: curl localhost:9200.
Run any containers with the same logging options as defined in this suite's docker-compose.ymland add a container_group label to enable monitoring, logging and alerting for them.
AFTER you're done testing this suite, you can revert to the state before setting up by running the cleanup script to clean up after yourself: sh cleanup.sh unsecure.

For debugging: In case you would like certain containers to log to stdoutbecause you're having trouble with ELK or simply because it feels more natural to you, you can simply comment out the logging options for individual containers. Logs of those containers will go to stdout while the logs for all other containers will continue to go to logstash.

#    logging:
#      driver: gelf
#      options:
#        gelf-address: udp://172.16.0.38:12201
#        labels: container_group

Alerting and Annotations in Grafana

Notice: Since v4.0, Grafana also does alerting - with quite a nice GUI. I haven't tried it yet myself, but I encourage you to look into it: http://docs.grafana.org/guides/whats-new-in-v4/.

This suite uses elastalert and Alertmanager for alerting. Rules for logging alerts (elastalert) go into ./elastalert/rules/ and rules for monitoring alerts (Alertmanager) go into ./prometheus/rules/. Alertmanager only takes care of the communications part the monitoring alerts, the rules themselves are defined "in" Prometheus.

Both Alertmanager and elastalert can be configured to send their alerts to various outputs. In this suite, Logstash and Slack are set up. The integration with Logstash works out of the box, for adding Slack you will need to insert your webhook url.

The alerts that are sent to Logstash can be checked by looking at the 'logstash-alerts' index in Kibana. Apart from functioning as a first output, sending and storing the alerts to Elasticsearch via Logstash is also neat because it allows us to query them from Grafana and have them imported to its Dashboards as annotations.

annotations_screenshot

The monitoring alerting rules, which are stored in the Prometheus directory, contain a fake alert that should be firing from the beginning and demonstrates the concept. Find it and comment it out to have some peace. Also, there should be logging alerts coming in soon as well, this suite by itself already consists of 10 containers, and something is always complaining. Of course you can also force things by breaking stuff yourself - the blanket_log-level_catch.yaml rule that's already set up should catch it.

If you're annoyed by non-events repeatedly triggering alerts, throw them in ./logstash/config/31-non-events.conf in order for logstash to silence them by overwritting their log_level upon import.

Grafana/Prometheus Query Building

Unfortunately Grafana doesn't appear to have a fancy query builder for Prometheus as it has for Graphite or InfluxDB, instead one has to plainly type out one's queries.

Alas, when building Grafana graphs/dashboards with Prometheus as a data storage, knowing it's query dsl and metric types is important. This also means, that documentation about using Grafana with an InfluxDB won't help you much, further narrowing down the number of available resources. This is kind of unfortunate.

Here you can find the official documentation for Prometheus on both the query dsl and the metric types:

Information on Prometheus Querying

Information on Prometheus Metric Types

Furthermore, since I couldn't find proper documentation on the metrics cAdvisor and Prometheus/Node-Exporter expose, I decided to just take the info from the /metrics entpoints and bring it into a human-readable format.

Check them here. Combining the information on the exposed metrics themselves with that on Prometheus' query dsl and metric types, you should be good to go to build some beautiful dashboards yourself.

Known Issues

Bad umask: If your umask is bad, and not for example 0022, it could create files/folder with low permissions. Some containers do not start up when that is the case, e.g. Kibana can't read the configd. Setting this umask before downloading the git repo fixes this issue. (pointed out by @riemers)