Closed lfrancke closed 2 years ago
I had a look at Grafana and Kibana in more Detail. I dismissed products like DashBuilder, FreeBoard, Graphite etc. for lack of functionality, bad user experience or hidden "premium" models.
Grafana (https://grafana.com/) | Kibana (https://www.elastic.co/de/kibana/) | |
---|---|---|
Logs and Metrics | Designed to analyze and visualize metrics like CPU usage, Memory etc. | Runs on top of Elasticsearch and is used primarily for analyzing log messages |
Installation / Configuration | Easy to configure via .ini file and overrides via ENV variables | Easy to configure via YAML files |
Data sources | Many data sources like Graphite, Prometheus, InfluxDB, MySQL, PostgreSQL, and Elasticsearch and even more using plugins | Elasticsearch instance ist required and the only possible data source. Needs to be shipped into the ELK Stack (via Filebeat or Metricbeat, then Logstash, then Elasticsearch) |
Access control & Authentication | Built-in user control and authentication mechanisms that allow you to restrict and control access to your dashboards. LDAP or external SQL servers possible. | By default public access. Commerical solutions X-Pack or open source like SearchGuard available. LDAP supported. |
Querying | Comfortable query editor for different data sources | Querying via Lucene syntax or the Elasticsearch Query DSL |
Templating | Dashboards can be imported and exported via JSON which makes it easy to customize and template | Dashboards can be imported and exported via JSON which makes it easy to customize and template |
Dashboards and UI | Versatile panels can visualize data from different data sources. Graph, singlestat, table, heatmap and freetext panel types are available and many dashboard templates exsist | Many visualization possibilities like pie charts, line charts, data tables, single metric visualizations, geo maps, time series and markdown visualizations can all be combined into dashboards. |
Alerts | Shipped with a built-in alerting engine that allows users to attach conditional rules to dashboard panels that result in triggered alerts to a notification endpoint of your choice (e.g. Email, Slack, custom webhooks) | No out of the box support but can opt for a hosted ELK Stack such as Logz.io, implement ElastAlert or use X-Pack. |
I tested a little with the Prometheus Operator which already includes Grafana.
I created a Zookeeper Cluster:
./create_test_cluster.py --kind kind --operator zookeeper --debug --prometheus
and applied the simple examples (https://github.com/stackabletech/zookeeper-operator/tree/main/examples)
This results looks something like this (Pods):
NAME READY STATUS RESTARTS AGE
alertmanager-prometheus-operator-kube-p-alertmanager-0 2/2 Running 0 23m
prometheus-operator-grafana-6c475f9867-r8f7s 3/3 Running 0 23m
prometheus-operator-kube-p-operator-654ccf58ff-dvpw7 1/1 Running 0 23m
prometheus-operator-kube-state-metrics-764767b8f5-cbv79 1/1 Running 0 23m
prometheus-operator-prometheus-node-exporter-92dqd 1/1 Running 0 23m
prometheus-operator-prometheus-node-exporter-dltg4 1/1 Running 0 23m
prometheus-operator-prometheus-node-exporter-lcbq8 1/1 Running 0 23m
prometheus-operator-prometheus-node-exporter-xdfkt 1/1 Running 0 23m
prometheus-prometheus-operator-kube-p-prometheus-0 2/2 Running 0 23m
simple-zk-server-primary-0 1/1 Running 0 12m
simple-zk-server-primary-1 1/1 Running 0 12m
simple-zk-server-secondary-0 1/1 Running 0 12m
zookeeper-operator-deployment-59b79d9b47-n6mz2 1/1 Running 0 25m
And Services:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 25m
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 28m
prometheus-operated ClusterIP None <none> 9090/TCP 25m
prometheus-operator-grafana ClusterIP 10.96.208.93 <none> 80/TCP 26m
prometheus-operator-kube-p-alertmanager ClusterIP 10.96.201.207 <none> 9093/TCP 26m
prometheus-operator-kube-p-operator ClusterIP 10.96.157.26 <none> 443/TCP 26m
prometheus-operator-kube-p-prometheus ClusterIP 10.96.58.76 <none> 9090/TCP 26m
prometheus-operator-kube-state-metrics ClusterIP 10.96.146.5 <none> 8080/TCP 26m
prometheus-operator-prometheus-node-exporter ClusterIP 10.96.253.210 <none> 9100/TCP 26m
simple-zk NodePort 10.96.21.171 <none> 2181:32166/TCP 14m
simple-zk-server-primary ClusterIP None <none> 2181/TCP,9505/TCP 14m
simple-zk-server-secondary ClusterIP None <none> 2181/TCP,9505/TCP 14m
Now you can check if prometheus discovers ZooKeeper services correctly by port forwarding and checking the UI:
kubectl port-forward svc/prometheus-operator-kube-p-prometheus 9090
Check http://localhost:9090/service-discovery http://localhost:9090/targets
and check the content of serviceMonitor/default/scrape-label/0
which should include our 3 Zookeeper pods:
Now we need a port forward for the grafana service:
kubectl port-forward svc/prometheus-operator-grafana 11111:80
And to login retrieve the admin user password:
kubectl get secret prometheus-operator-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
Then try http://localhost:11111
for the Grafana UI.
For me the login was admin:prom-operator
.
Then check if the prometheus data source is already connected:
Then go and add a new panel:
Add some zookeeper metrics and save:
Now it should be available under the dashboards:
If you look at the dashboard you can export it (next to the header and the star) like this:
This will return a json file:
{
"__inputs": [
{
"name": "DS_PROMETHEUS",
"label": "Prometheus",
"description": "",
"type": "datasource",
"pluginId": "prometheus",
"pluginName": "Prometheus"
}
],
"__elements": [],
"__requires": [
{
"type": "grafana",
"id": "grafana",
"name": "Grafana",
"version": "8.3.5"
},
{
"type": "datasource",
"id": "prometheus",
"name": "Prometheus",
"version": "1.0.0"
},
{
"type": "panel",
"id": "timeseries",
"name": "Time series",
"version": ""
}
],
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"target": {
"limit": 100,
"matchAny": false,
"tags": [],
"type": "dashboard"
},
"type": "dashboard"
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"liveNow": false,
"panels": [
{
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 9,
"w": 12,
"x": 0,
"y": 0
},
"id": 2,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"exemplar": true,
"expr": "zookeeper_PacketsSent",
"interval": "",
"legendFormat": "",
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"hide": false,
"refId": "B"
}
],
"title": "Panel Title",
"type": "timeseries"
}
],
"refresh": "",
"schemaVersion": 34,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "Zookeeper",
"uid": "nU-nyWBnk",
"version": 2,
"weekStart": ""
}
Now you can try to delete the ZooKeeper dashboard and reimport the created JSON file: and pick the prometheus data source again:
This should recreate the old Zookeeper dashboard.
Using two Kafka clusters:
NAME READY STATUS RESTARTS AGE
alertmanager-prometheus-operator-kube-p-alertmanager-0 2/2 Running 0 102m
kafka-operator-deployment-54899996ff-9p277 1/1 Running 0 16m
prometheus-operator-grafana-6c475f9867-r8f7s 3/3 Running 0 102m
prometheus-operator-kube-p-operator-654ccf58ff-dvpw7 1/1 Running 0 102m
prometheus-operator-kube-state-metrics-764767b8f5-cbv79 1/1 Running 0 102m
prometheus-operator-prometheus-node-exporter-92dqd 1/1 Running 0 102m
prometheus-operator-prometheus-node-exporter-dltg4 1/1 Running 0 102m
prometheus-operator-prometheus-node-exporter-lcbq8 1/1 Running 0 102m
prometheus-operator-prometheus-node-exporter-xdfkt 1/1 Running 0 102m
prometheus-prometheus-operator-kube-p-prometheus-0 2/2 Running 0 102m
simple-kafka-2-broker-default-0 2/2 Running 0 12m
simple-kafka-2-broker-default-1 2/2 Running 0 12m
simple-kafka-2-broker-default-2 2/2 Running 0 12m
simple-kafka-broker-default-0 2/2 Running 0 16m
simple-kafka-broker-default-1 2/2 Running 0 16m
simple-kafka-broker-default-2 2/2 Running 0 16m
simple-kafka-broker-default-3 2/2 Running 0 16m
simple-kafka-broker-default-4 2/2 Running 0 16m
simple-zk-server-default-0 1/1 Running 0 16m
simple-zk-server-default-1 1/1 Running 0 16m
simple-zk-server-default-2 1/1 Running 0 16m
zookeeper-operator-deployment-59b79d9b47-n6mz2 1/1 Running 0 104m
We can select only one cluster via the Grafana metrics browser:
which results in:
And use that for templating dashboards later on via json:
...
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
},
"exemplar": true,
"expr": "kafka_network_requestmetrics_requestbytes_count{job=\"simple-kafka\"}",
"hide": false,
"interval": "",
"legendFormat": "",
"refId": "A"
}
],
"title": "simple-kafka",
"type": "timeseries"
}
],
...
I checked a couple of more tools (including graphite, graylog) and the most promising is Netdata with GPL v3 license, prometheus support etc.
I still prefer Grafana for now (and with @sbernauer as expert?) with the prometheus operator integration.
Any other proposals / opinions?
The only other tool that comes to mind is Chronograf from the InfluxDB stack: it requires an influxDB as a backend and can scrape Prometheus metrics directly. However, I can't see that if offers anything that Grafana doesn't, and introduces an extra component (influxDB) which we would otherwise not need (unless perhaps for IIoT contexts).
Yeah except for Kibana (and ES stack) i excluded everything that was based on other external tools like influxDB or was more directed to log monitoring. I think that would be a separate topic. Just checked redash quickly, but there is no prometheus mentioned in the integrations (did not dig any further for plugins etc.).
We decided to pick Grafana for metrics aggregation for now.
We do support collecting metrics via Prometheus but we do not yet have any support for displaying those in dashboards. The go to solution is Grafana but there might be other options out there.
This ticket serves as a combination of a research & test spike with the following goals:
The points above serve as a rough guidelines but your own ideas are very welcome. Our goal is to be able to have a charting/dashboarding solution that we can customize/ship in code. Grafana is the most popular one out there so we should definitely have an answer there but it's AGPL and some customers might not like that so it'd be good to see if there are other options out there.