Grafana/Dashboarding Spike

lfrancke commented 2 years ago

We do support collecting metrics via Prometheus but we do not yet have any support for displaying those in dashboards. The go to solution is Grafana but there might be other options out there.

This ticket serves as a combination of a research & test spike with the following goals:

[x] Research dashboarding solutions besides Grafana (they need to be Open Source, OSI compatible)
[x] Set up Grafana (no matter the outcome of the previous point) and research availability of operators/helm charts etc.
[x] Build a dashboard for Airflow and Kafka to include some stats, the content doesn't really matter
[x] Export & import those dashboards to see how we can distribute those precreated dashboards easily
[x] Present your results to the team, it'd be good if everyone was aware of how this works

The points above serve as a rough guidelines but your own ideas are very welcome. Our goal is to be able to have a charting/dashboarding solution that we can customize/ship in code. Grafana is the most popular one out there so we should definitely have an answer there but it's AGPL and some customers might not like that so it'd be good to see if there are other options out there.

maltesander commented 2 years ago

I had a look at Grafana and Kibana in more Detail. I dismissed products like DashBuilder, FreeBoard, Graphite etc. for lack of functionality, bad user experience or hidden "premium" models.

	Grafana (https://grafana.com/)	Kibana (https://www.elastic.co/de/kibana/)
Logs and Metrics	Designed to analyze and visualize metrics like CPU usage, Memory etc.	Runs on top of Elasticsearch and is used primarily for analyzing log messages
Installation / Configuration	Easy to configure via .ini file and overrides via ENV variables	Easy to configure via YAML files
Data sources	Many data sources like Graphite, Prometheus, InfluxDB, MySQL, PostgreSQL, and Elasticsearch and even more using plugins	Elasticsearch instance ist required and the only possible data source. Needs to be shipped into the ELK Stack (via Filebeat or Metricbeat, then Logstash, then Elasticsearch)
Access control & Authentication	Built-in user control and authentication mechanisms that allow you to restrict and control access to your dashboards. LDAP or external SQL servers possible.	By default public access. Commerical solutions X-Pack or open source like SearchGuard available. LDAP supported.
Querying	Comfortable query editor for different data sources	Querying via Lucene syntax or the Elasticsearch Query DSL
Templating	Dashboards can be imported and exported via JSON which makes it easy to customize and template	Dashboards can be imported and exported via JSON which makes it easy to customize and template
Dashboards and UI	Versatile panels can visualize data from different data sources. Graph, singlestat, table, heatmap and freetext panel types are available and many dashboard templates exsist	Many visualization possibilities like pie charts, line charts, data tables, single metric visualizations, geo maps, time series and markdown visualizations can all be combined into dashboards.
Alerts	Shipped with a built-in alerting engine that allows users to attach conditional rules to dashboard panels that result in triggered alerts to a notification endpoint of your choice (e.g. Email, Slack, custom webhooks)	No out of the box support but can opt for a hosted ELK Stack such as Logz.io, implement ElastAlert or use X-Pack.

I tested a little with the Prometheus Operator which already includes Grafana.

I created a Zookeeper Cluster:

./create_test_cluster.py --kind kind --operator zookeeper --debug --prometheus

and applied the simple examples (https://github.com/stackabletech/zookeeper-operator/tree/main/examples)

This results looks something like this (Pods):

NAME                                                      READY   STATUS    RESTARTS   AGE
alertmanager-prometheus-operator-kube-p-alertmanager-0    2/2     Running   0          23m
prometheus-operator-grafana-6c475f9867-r8f7s              3/3     Running   0          23m
prometheus-operator-kube-p-operator-654ccf58ff-dvpw7      1/1     Running   0          23m
prometheus-operator-kube-state-metrics-764767b8f5-cbv79   1/1     Running   0          23m
prometheus-operator-prometheus-node-exporter-92dqd        1/1     Running   0          23m
prometheus-operator-prometheus-node-exporter-dltg4        1/1     Running   0          23m
prometheus-operator-prometheus-node-exporter-lcbq8        1/1     Running   0          23m
prometheus-operator-prometheus-node-exporter-xdfkt        1/1     Running   0          23m
prometheus-prometheus-operator-kube-p-prometheus-0        2/2     Running   0          23m
simple-zk-server-primary-0                                1/1     Running   0          12m
simple-zk-server-primary-1                                1/1     Running   0          12m
simple-zk-server-secondary-0                              1/1     Running   0          12m
zookeeper-operator-deployment-59b79d9b47-n6mz2            1/1     Running   0          25m

And Services:

NAME                                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
alertmanager-operated                          ClusterIP   None            <none>        9093/TCP,9094/TCP,9094/UDP   25m
kubernetes                                     ClusterIP   10.96.0.1       <none>        443/TCP                      28m
prometheus-operated                            ClusterIP   None            <none>        9090/TCP                     25m
prometheus-operator-grafana                    ClusterIP   10.96.208.93    <none>        80/TCP                       26m
prometheus-operator-kube-p-alertmanager        ClusterIP   10.96.201.207   <none>        9093/TCP                     26m
prometheus-operator-kube-p-operator            ClusterIP   10.96.157.26    <none>        443/TCP                      26m
prometheus-operator-kube-p-prometheus          ClusterIP   10.96.58.76     <none>        9090/TCP                     26m
prometheus-operator-kube-state-metrics         ClusterIP   10.96.146.5     <none>        8080/TCP                     26m
prometheus-operator-prometheus-node-exporter   ClusterIP   10.96.253.210   <none>        9100/TCP                     26m
simple-zk                                      NodePort    10.96.21.171    <none>        2181:32166/TCP               14m
simple-zk-server-primary                       ClusterIP   None            <none>        2181/TCP,9505/TCP            14m
simple-zk-server-secondary                     ClusterIP   None            <none>        2181/TCP,9505/TCP            14m

Now you can check if prometheus discovers ZooKeeper services correctly by port forwarding and checking the UI:

kubectl port-forward svc/prometheus-operator-kube-p-prometheus 9090

Check http://localhost:9090/service-discovery http://localhost:9090/targets

and check the content of serviceMonitor/default/scrape-label/0 which should include our 3 Zookeeper pods:

Now we need a port forward for the grafana service: kubectl port-forward svc/prometheus-operator-grafana 11111:80

And to login retrieve the admin user password: kubectl get secret prometheus-operator-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

Then try http://localhost:11111 for the Grafana UI.

For me the login was admin:prom-operator.

Then check if the prometheus data source is already connected:

Then go and add a new panel:

Add some zookeeper metrics and save:

Now it should be available under the dashboards:

If you look at the dashboard you can export it (next to the header and the star) like this:

This will return a json file:

{
  "__inputs": [
    {
      "name": "DS_PROMETHEUS",
      "label": "Prometheus",
      "description": "",
      "type": "datasource",
      "pluginId": "prometheus",
      "pluginName": "Prometheus"
    }
  ],
  "__elements": [],
  "__requires": [
    {
      "type": "grafana",
      "id": "grafana",
      "name": "Grafana",
      "version": "8.3.5"
    },
    {
      "type": "datasource",
      "id": "prometheus",
      "name": "Prometheus",
      "version": "1.0.0"
    },
    {
      "type": "panel",
      "id": "timeseries",
      "name": "Time series",
      "version": ""
    }
  ],
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": null,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 9,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "id": 2,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom"
        },
        "tooltip": {
          "mode": "single"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "exemplar": true,
          "expr": "zookeeper_PacketsSent",
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "hide": false,
          "refId": "B"
        }
      ],
      "title": "Panel Title",
      "type": "timeseries"
    }
  ],
  "refresh": "",
  "schemaVersion": 34,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "Zookeeper",
  "uid": "nU-nyWBnk",
  "version": 2,
  "weekStart": ""
}

Now you can try to delete the ZooKeeper dashboard and reimport the created JSON file: and pick the prometheus data source again:

This should recreate the old Zookeeper dashboard.

maltesander commented 2 years ago

Using two Kafka clusters:

NAME                                                      READY   STATUS    RESTARTS   AGE
alertmanager-prometheus-operator-kube-p-alertmanager-0    2/2     Running   0          102m
kafka-operator-deployment-54899996ff-9p277                1/1     Running   0          16m
prometheus-operator-grafana-6c475f9867-r8f7s              3/3     Running   0          102m
prometheus-operator-kube-p-operator-654ccf58ff-dvpw7      1/1     Running   0          102m
prometheus-operator-kube-state-metrics-764767b8f5-cbv79   1/1     Running   0          102m
prometheus-operator-prometheus-node-exporter-92dqd        1/1     Running   0          102m
prometheus-operator-prometheus-node-exporter-dltg4        1/1     Running   0          102m
prometheus-operator-prometheus-node-exporter-lcbq8        1/1     Running   0          102m
prometheus-operator-prometheus-node-exporter-xdfkt        1/1     Running   0          102m
prometheus-prometheus-operator-kube-p-prometheus-0        2/2     Running   0          102m
simple-kafka-2-broker-default-0                           2/2     Running   0          12m
simple-kafka-2-broker-default-1                           2/2     Running   0          12m
simple-kafka-2-broker-default-2                           2/2     Running   0          12m
simple-kafka-broker-default-0                             2/2     Running   0          16m
simple-kafka-broker-default-1                             2/2     Running   0          16m
simple-kafka-broker-default-2                             2/2     Running   0          16m
simple-kafka-broker-default-3                             2/2     Running   0          16m
simple-kafka-broker-default-4                             2/2     Running   0          16m
simple-zk-server-default-0                                1/1     Running   0          16m
simple-zk-server-default-1                                1/1     Running   0          16m
simple-zk-server-default-2                                1/1     Running   0          16m
zookeeper-operator-deployment-59b79d9b47-n6mz2            1/1     Running   0          104m

We can select only one cluster via the Grafana metrics browser:

which results in:

And use that for templating dashboards later on via json:

...
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "PBFA97CFB590B2093"
          },
          "exemplar": true,
          "expr": "kafka_network_requestmetrics_requestbytes_count{job=\"simple-kafka\"}",
          "hide": false,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "title": "simple-kafka",
      "type": "timeseries"
    }
  ],
...

maltesander commented 2 years ago

I checked a couple of more tools (including graphite, graylog) and the most promising is Netdata with GPL v3 license, prometheus support etc.

I still prefer Grafana for now (and with @sbernauer as expert?) with the prometheus operator integration.

Any other proposals / opinions?

adwk67 commented 2 years ago

The only other tool that comes to mind is Chronograf from the InfluxDB stack: it requires an influxDB as a backend and can scrape Prometheus metrics directly. However, I can't see that if offers anything that Grafana doesn't, and introduces an extra component (influxDB) which we would otherwise not need (unless perhaps for IIoT contexts).

maltesander commented 2 years ago

Yeah except for Kibana (and ES stack) i excluded everything that was based on other external tools like influxDB or was more directed to log monitoring. I think that would be a separate topic. Just checked redash quickly, but there is no prometheus mentioned in the integrations (did not dig any further for plugins etc.).

maltesander commented 2 years ago

We decided to pick Grafana for metrics aggregation for now.

stackabletech / issues

Grafana/Dashboarding Spike #176