Creating a Prometheus -> Grafana dashboard that measures chaos start/stop with Pod Delete

What happened: I'm trying to create a Grafana dashboard such that Litmus Kubernetes events are exported to Prometheus using heptio eventrouter (manifest & service monitor below), monitoring an example hello-world service (manifest below). The idea here is to get a very clear dashboard that shows when chaos runs and the impact on the availability of the service.

This generally works by watching ChaosInject, but I have not be able to see a stable "start/stop" indicator of the chaos experiment. Right now I'm using these promqls as annotations on the Grafana panel:

Start: changes(heptio_eventrouter_normal_total{involved_object_name="chaos-engine-pod-delete-runner",reason="Started", involved_object_namespace="hello-world-alpha"}[1m])

End: changes(heptio_eventrouter_normal_total{involved_object_name="chaos-engine-pod-delete-runner",reason="Killing", involved_object_namespace="hello-world-alpha"}[1m])

The start event seems to spawn pretty consistently, but the 'end' event ('Killing') appears inconsistently and intermittently.

What you expected to happen:

Ideally, here is what I would want to see (this is an example of when it was working):

Screen Shot 2020-04-21 at 11 06 13 AM

Green lines (annotations) are chaos start, red lines (annotations) are chaos stop, green time-series metric is the blackbox monitor ping, and yellow is the ChaosInject. You can see that while chaos is running, the hello world service is down, but as soon as chaos stops, it comes back until such time as the CronJob invokes it again.

Of note: "ChaosEngineInitialized", "ChaosEngineCompleted", "ChaosEngineStopped" seem to be far less reliable in this regard, the correlation with the ability to create a graph such as the above is much weaker. I do not understand what these metrics mean and how they are supposed to be used.

How to reproduce it (as minimally and precisely as possible):

I set this up on an EKS Kubernetes cluster running 1.15.10 that has the prometheus operator (5.2.0) exporting to Grafana (6.5.2).

Commands used to set this up: Install blackbox exporter: helm install prometheus-blackbox-exporter stable/prometheus-blackbox-exporter

Grafana Panel JSON:

{
  "datasource": "[datasource]",
  "aliasColors": {},
  "bars": false,
  "dashLength": 10,
  "dashes": false,
  "fill": 1,
  "fillGradient": 0,
  "gridPos": {
    "h": 6,
    "w": 24,
    "x": 0,
    "y": 0
  },
  "hiddenSeries": false,
  "id": 6,
  "legend": {
    "avg": false,
    "current": false,
    "max": false,
    "min": false,
    "show": true,
    "total": false,
    "values": false
  },
  "lines": true,
  "linewidth": 1,
  "nullPointMode": "null",
  "options": {
    "dataLinks": []
  },
  "percentage": false,
  "pointradius": 2,
  "points": false,
  "renderer": "flot",
  "seriesOverrides": [],
  "spaceLength": 10,
  "stack": false,
  "steppedLine": false,
  "targets": [
    {
      "expr": "{__name__=\"probe_success\",job=\"hello-world-alpha-prometheus-blackbox-exporter\"}",
      "refId": "A"
    },
    {
      "expr": "increase(heptio_eventrouter_normal_total{reason=\"ChaosInject\", involved_object_name=\"chaos-engine-pod-delete\", involved_object_namespace=\"hello-world-alpha\", involved_object_kind=\"ChaosEngine\"}[1m])",
      "refId": "B"
    }
  ],
  "thresholds": [],
  "timeFrom": null,
  "timeRegions": [],
  "timeShift": null,
  "title": "Hello World Alpha Chaos (Pod Delete)",
  "tooltip": {
    "shared": true,
    "sort": 0,
    "value_type": "individual"
  },
  "type": "graph",
  "xaxis": {
    "buckets": null,
    "mode": "time",
    "name": null,
    "show": true,
    "values": []
  },
  "yaxes": [
    {
      "format": "short",
      "label": null,
      "logBase": 1,
      "max": null,
      "min": null,
      "show": true
    },
    {
      "format": "short",
      "label": null,
      "logBase": 1,
      "max": null,
      "min": null,
      "show": true
    }
  ],
  "yaxis": {
    "align": false,
    "alignLevel": null
  }
}

Relevant Kubernetes manifests:

Service under test:

---
# Source: hello-world/templates/hello-world-test-application.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: hello-world-alpha
---
# Source: hello-world/templates/hello-world-test-application.yaml
kind: Service
apiVersion: v1
metadata:
  name: hello-world-alpha-loadbalancer
  namespace: hello-world-alpha
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0
spec:
  selector:
    app: hello-world-alpha-application
  ports:
    - protocol: TCP
      name: web
      port: 80
      targetPort: 8080
  type: LoadBalancer
---
# Source: hello-world/templates/hello-world-test-application.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-world-alpha-service
  namespace: hello-world-alpha
  labels:
    app: hello-world-alpha-application
  annotations:
    litmuschaos.io/chaos: "true"
spec:
  selector:
    matchLabels:
      app: hello-world-alpha-application
  replicas: 1
  template:
    metadata:
      labels:
        app: hello-world-alpha-application
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                topologyKey: kubernetes.io/hostname
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - hello-world-alpha-application
      initContainers:
        - name: hello-world-delay
          image: alpine:3.11
          imagePullPolicy: IfNotPresent
          command: ["/bin/sh"]
          args: ["-c", "sleep 10s"]
      containers:
        - name: hello-world
          image: gcr.io/google-samples/node-hello:1.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
              protocol: TCP
          readinessProbe:
            tcpSocket:
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            tcpSocket:
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5

Pod Delete CronJob manifest:

---
# Source: chaos_manifests/templates/pod-delete.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-delete-manifest
  namespace: hello-world-alpha
data:
  pod-delete-manifest.yaml: |
    apiVersion: litmuschaos.io/v1alpha1
    kind: ChaosEngine
    metadata:
      name: chaos-engine-pod-delete
      namespace: hello-world-alpha
    spec:
      appinfo:
        appns: hello-world-alpha
        applabel: app=hello-world-alpha-application
        appkind: deployment
      # It can be true/false
      annotationCheck: "true"
      # It can be active/stop
      engineState: "active"
      #ex. values: ns1:name=percona,ns2:run=nginx
      auxiliaryAppInfo: ''
      #ex. values: ns1:name=percona,ns2:run=nginx
      auxiliaryAppInfo: ""
      chaosServiceAccount: hello-world
-chaos-admin
      monitoring: false
      # It can be delete/retain
      jobCleanUpPolicy: "delete" # "retain" for debugging. "delete" for production.
      experiments:
        - name: pod-delete
          spec:
            components:
              env:
                # set chaos duration (in sec) as desired
                - name: TOTAL_CHAOS_DURATION
                  value: "160"
                # set chaos interval (in sec) as desired
                - name: CHAOS_INTERVAL
                  value: "20"
                # pod failures without --force & default terminationGracePeriodSeconds
                - name: FORCE
                  value: "false"
---
# Source: chaos_manifests/templates/pod-delete.yaml
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: pod-delete-litmus-chaos-cron
  namespace: hello-world-alpha
spec:
  schedule: "*/7 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: hello-world-chaos-admin
          initContainers:
            - name: pod-delete-remove-chaos-engine
              image: [image url]
              imagePullPolicy: IfNotPresent
              args:
                - "delete"
                - "ChaosEngine"
                - "chaos-engine-pod-delete"
                - "-n"
                - "hello-world-alpha"
                - "--ignore-not-found=true"
            - name: pod-delete-remove-chaos-engine-runner-events
              image: [image url]
              imagePullPolicy: IfNotPresent
              args:
                - "delete"
                - "events"
                - "--field-selector"
                - "involvedObject.name='chaos-engine-pod-delete-runner'"
                - "-n"
                - "hello-world-alpha"
            - name: pod-delete-remove-chaos-engine-events
              image: [image url]
              imagePullPolicy: IfNotPresent
              args:
                - "delete"
                - "events"
                - "--field-selector"
                - "involvedObject.name='chaos-engine-pod-delete'"
                - "-n"
                - "hello-world-alpha"
          containers:
            - name: pod-delete-litmus-chaos-cron
              image: [image url]
              imagePullPolicy: IfNotPresent
              args:
                - "apply"
                - "-f"
                - "/manifest/pod-delete-manifest.yaml"
              volumeMounts:
                - mountPath: /manifest
                  name: manifest-temp
          restartPolicy: OnFailure
          volumes:
            - name: manifest-temp
              configMap:
                name: pod-delete-manifest
                defaultMode: 0777

Blackbox Monitor Manifest:

---
# Source: chaos_manifests/templates/blackbox-service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  generation: 1
  labels:
    app.kubernetes.io/instance: prometheus-blackbox-exporter
    app.kubernetes.io/name: prometheus-blackbox-exporter
    release: prometheus-operator
  name: hello-world-alpha-prometheus-blackbox-exporter
  namespace: monitoring
spec:
  endpoints:
    - interval: 5s
      metricRelabelings:
        - sourceLabels:
            - __address__
          targetLabel: __param_target
        - sourceLabels:
            - __param_target
          targetLabel: instance
        - replacement: [endpoint]
          targetLabel: target
        - replacement: hello-world-alpha-prometheus-blackbox-exporter
          targetLabel: job
      params:
        module:
          - http_2xx
        target:
          - [endpoint]
      path: /probe
      port: http
      scheme: http
      scrapeTimeout: 5s
  jobLabel: prometheus-blackbox-exporter
  namespaceSelector:
    matchNames:
      - monitoring
      - default
  selector:
    matchLabels:
      app.kubernetes.io/instance: prometheus-blackbox-exporter
      app.kubernetes.io/name: prometheus-blackbox-exporter

Eventrouter manifest:

# Copyright 2017 Heptio Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: eventrouter
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: eventrouter
rules:
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: eventrouter
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: eventrouter
subjects:
  - kind: ServiceAccount
    name: eventrouter
    namespace: kube-system
---
apiVersion: v1
data:
  config.json: |-
    {
      "sink": "glog"
    }
kind: ConfigMap
metadata:
  name: eventrouter-cm
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: eventrouter
  namespace: kube-system
  labels:
    app: eventrouter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: eventrouter
  template:
    metadata:
      labels:
        app: eventrouter
        tier: control-plane-addons
    spec:
      containers:
        - name: kube-eventrouter
          image: gcr.io/heptio-images/eventrouter:latest
          imagePullPolicy: IfNotPresent
          ports:
          - containerPort: 8080
            name: metrics
            protocol: TCP
          volumeMounts:
            - name: config-volume
              mountPath: /etc/eventrouter
      serviceAccount: eventrouter
      volumes:
        - name: config-volume
          configMap:
            name: eventrouter-cm
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: eventrouter-prometheus
  name: eventrouter-prometheus
  namespace: kube-system
spec:
  ports:
  - name: metrics
    port: 8080
    protocol: TCP
    targetPort: metrics
  selector:
    app: eventrouter
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app: eventrouter-prometheus
    release: prometheus-operator
  name: eventrouter-prometheus
  namespace: kube-system
spec:
  endpoints:
    - interval: 10s
      path: /metrics
      port: metrics
  selector:
    matchLabels:
      app: eventrouter-prometheus

Thank you for this really detailed issue @cpitstick-argo !! We shall try this setup & get back.

Start: changes(heptio_eventrouter_normal_total{involved_object_name="chaos-engine-pod-delete",reason="ExperimentJobCreate", involved_object_namespace="hello-world-alpha"}[1m])

End: changes(heptio_eventrouter_normal_total{involved_object_name="chaos-engine-pod-delete",reason="ExperimentJobCleanUp", involved_object_namespace="hello-world-alpha"}[1m])

One other panel I neglected to show was this:

{
  "datasource": "[datasource]",
  "aliasColors": {},
  "bars": false,
  "dashLength": 10,
  "dashes": false,
  "fill": 1,
  "fillGradient": 0,
  "gridPos": {
    "h": 6,
    "w": 24,
    "x": 0,
    "y": 6
  },
  "hiddenSeries": false,
  "id": 5,
  "legend": {
    "avg": false,
    "current": false,
    "max": false,
    "min": false,
    "show": true,
    "total": false,
    "values": false
  },
  "lines": true,
  "linewidth": 1,
  "nullPointMode": "null",
  "options": {
    "dataLinks": []
  },
  "percentage": false,
  "pointradius": 2,
  "points": false,
  "renderer": "flot",
  "seriesOverrides": [],
  "spaceLength": 10,
  "stack": false,
  "steppedLine": false,
  "targets": [
    {
      "expr": "changes(heptio_eventrouter_normal_total{involved_object_name=\"chaos-engine-pod-delete\",reason=\"ExperimentJobCreate\", involved_object_namespace=\"hello-world-alpha\"}[1m])",
      "instant": false,
      "legendFormat": "Start",
      "refId": "A"
    },
    {
      "expr": "changes(heptio_eventrouter_normal_total{involved_object_name=\"chaos-engine-pod-delete\",reason=\"PostChaosCheck\", involved_object_namespace=\"hello-world-alpha\"}[1m])",
      "instant": false,
      "legendFormat": "Stop",
      "refId": "B"
    }
  ],
  "thresholds": [],
  "timeFrom": null,
  "timeRegions": [],
  "timeShift": null,
  "title": "Hello Alpha Chaos Start/Stop (Pod Delete)",
  "tooltip": {
    "shared": true,
    "sort": 0,
    "value_type": "individual"
  },
  "type": "graph",
  "xaxis": {
    "buckets": null,
    "mode": "time",
    "name": null,
    "show": true,
    "values": []
  },
  "yaxes": [
    {
      "format": "short",
      "label": null,
      "logBase": 1,
      "max": null,
      "min": null,
      "show": true
    },
    {
      "format": "short",
      "label": null,
      "logBase": 1,
      "max": null,
      "min": null,
      "show": true
    }
  ],
  "yaxis": {
    "align": false,
    "alignLevel": null
  }
}

Thanks for the info @cpitstick-argo ! Just to summarize the discussion we had:

Updating to query to reasons ExperimentJobCreate (start) & PostChaosCheck (stop) with source as engine helps to stabilize the visualization. This is done in a bid to discount/remove-from-picture the variables of exp-pod-creation-time, app-status-checks and app-recovery-checks.
We will use the spec.concurrencyPolicy set to forbid in the cronjob (will be set as default in the upcoming schedule CR as well)

Having said that there are few things that need a deeper look:

The pod deletion events from kubelet for runner & exp pods are not consistent. Need to see if there is something we can do in our deletion api call (note that the pod-kill performed via kubectl delete by the chaoslib always adds it) or if there is some kubernetes quirk behind this.
There is a rare case where dashboard shows two events with same reason (chaosinject) for the same timestamp. Need to check if ansible task is responsible for this or it is system/infra induced.
We will work towards obtaining similar graphs via exporter metrics over deriving from events. A first cut will be available in 1.4

@rajdas98 - it would be useful to have your inputs recorded here based on the recent chaos grafana dashboard you had written.

I'm now getting an almost perfect cadence, except for weird blips that appear every ~3 runs of the chaos experiment. These blips aren't just anomalous ExperimentJobCreate and PostChaosCheck events. Instead, the experiment seems to be running long after everything was terminated, as these blips occur in the "dead time" between runs of the experiment.

Here's the image:

Screen Shot 2020-04-23 at 1 02 52 PM

I'm almost certain I have now figured out the "ghost metrics." It's not a problem with Litmus.

It's an issue with the Prometheus integration of eventrouter. I dug into the logs from Litmus as well as eventrouter, and eventrouter is (re)processing event updates that happen on a regular cadence, but the events that it is exporting on the "update" event are identical to the original event. It's a ghost because it is literally an identical repeat of a metric that has come before.

Containership maintained a fork of eventrouter once upon a time, and they actually fixed this issue in their fork:

https://github.com/containership/eventrouter/commit/f62fe77a43bf06fd846acb73ee4d473186a8ef5b

I haven't yet been able to test this, I'll get to that next week; still, my certainty that this is one of the main reasons I saw these issues in the first place is very high.

Several possible actions stem from this:

Litmus needs its own event exporter similar to event router tailored to these use-cases.
Litmus can contribute to what looks to be the most promising long-term solution for this: https://github.com/opsgenie/kubernetes-event-exporter (But it doesn't have prometheus exporters right now, and I don't know that I would have the time to add it...)
Litmus should just directly export to prometheus and skip this step
All of the above? Something else?

Great! Thanks for the update @cpitstick-argo ! Option (3) is what we are working towards in the near term. I suppose the next step (should share this shortly) is identical set of dashboards with chaos exporter metrics. Having said that, we are finding the event routers to be pretty nifty wrt how we treat & persist events - so will circle around on that at some point.

I just finished testing this and can confirm it worked. Applied the patch a local fork of the event router and used it. No more ghost metrics!

I also was able to go back to using ChaosEngineInitialized and ChaosEngineCompleted

{
  "datasource": "[Source]",
  "aliasColors": {},
  "bars": false,
  "dashLength": 10,
  "dashes": false,
  "fill": 1,
  "fillGradient": 0,
  "gridPos": {
    "h": 6,
    "w": 24,
    "x": 0,
    "y": 0
  },
  "hiddenSeries": false,
  "id": 8,
  "legend": {
    "avg": false,
    "current": false,
    "max": false,
    "min": false,
    "show": true,
    "total": false,
    "values": false
  },
  "lines": true,
  "linewidth": 1,
  "nullPointMode": "null",
  "options": {
    "dataLinks": []
  },
  "percentage": false,
  "pointradius": 2,
  "points": false,
  "renderer": "flot",
  "seriesOverrides": [
    {
      "alias": "Service Availability",
      "color": "#F2CC0C"
    },
    {
      "alias": "Chaos Start",
      "bars": true,
      "color": "#56A64B"
    },
    {
      "alias": "Chaos End",
      "bars": true,
      "color": "#E02F44"
    },
    {
      "alias": "Chaos Inject (Pod Delete)",
      "color": "#3274D9"
    }
  ],
  "spaceLength": 10,
  "stack": false,
  "steppedLine": false,
  "targets": [
    {
      "expr": "{__name__=\"probe_success\",job=\"hello-world-alpha-prometheus-blackbox-exporter\"}",
      "legendFormat": "Service Availability",
      "refId": "A"
    },
    {
      "expr": "changes(heptio_eventrouter_normal_total{reason=\"ChaosInject\", involved_object_name=\"chaos-engine-pod-delete\", involved_object_namespace=\"hello-world-alpha\", involved_object_kind=\"ChaosEngine\"}[1m])",
      "format": "time_series",
      "legendFormat": "Chaos Inject (Pod Delete)",
      "refId": "B"
    },
    {
      "expr": "(changes(heptio_eventrouter_normal_total{involved_object_name=\"chaos-engine-pod-delete\",reason=\"ChaosEngineInitialized\", involved_object_namespace=\"hello-world-alpha\"}[30s]) > 0) * 4",
      "legendFormat": "Chaos Start",
      "refId": "C"
    },
    {
      "expr": "(changes(heptio_eventrouter_normal_total{involved_object_name=\"chaos-engine-pod-delete\",reason=\"ChaosEngineCompleted\", involved_object_namespace=\"hello-world-alpha\"}[30s]) > 0) * 4",
      "legendFormat": "Chaos End",
      "refId": "D"
    }
  ],
  "thresholds": [],
  "timeFrom": null,
  "timeRegions": [],
  "timeShift": null,
  "title": "Hello World Alpha Chaos (Pod Delete)",
  "tooltip": {
    "shared": true,
    "sort": 0,
    "value_type": "individual"
  },
  "type": "graph",
  "xaxis": {
    "buckets": null,
    "mode": "time",
    "name": null,
    "show": true,
    "values": []
  },
  "yaxes": [
    {
      "format": "short",
      "label": "Frequency",
      "logBase": 1,
      "max": 4,
      "min": 0,
      "show": true
    },
    {
      "format": "short",
      "label": null,
      "logBase": 1,
      "max": 4,
      "min": 0,
      "show": true
    }
  ],
  "yaxis": {
    "align": false,
    "alignLevel": null
  }
}

Screen Shot 2020-04-25 at 9 06 11 PM

I've thought a little more about this, and I think that Litmus' strategy of exporting notifications by events is a sound one. Just take a look at all the places this (https://github.com/opsgenie/kubernetes-event-exporter) exports to, with many more planned. Litmus cannot and should not try to solve that problem, it's huge and never-ending.

Instead, I do think the way things are is better. You export events (with tunable verbosity and maybe some other knobs and dials), and then provide links to supported collaborative software like opsgenie or eventrouter that will send the events to the places where they're needed for solid monitoring.

Maybe a default prometheus exporter is fine, or a limited set of default exporters as a way to show that monitoring is possible. Beyond that, the Litmus framework shouldn't bite off more than it can chew here.

@cpitstick-argo hi! I'm trying to do the same thing as you. Could you clarify how you achieved to see Kubernetes events in Grafana?

First, be on Litmus 1.2.0 or greater (Preferably the latest version. Things change rapidly so make sure you have the latest manifests)
I did not use admin mode. I installed everything into the same namespace as the app (The manifest for the hello world app is listed above)
You'll need to build your own custom copy of eventrouter with the containership CL mentioned above. The eventrouter manifest and the servicemonitor you need to attach to it should be above.
I used blackbox to monitor the HTTP endpoint, and I believe the manifest is above.

What else do you need?

I've achieved approximately the same result as you.

Going to take a look!

This issue was originally created to track inconsistencies in the events indicating start/stop of chaos, which was eventually fixed via a stable fork of the original integration/solution: heptio event router. Refer https://github.com/litmuschaos/litmus/issues/1472#issuecomment-619462834.

The current thought process is to continue building in more events and experiment metrics to enable existing tools in the ecosystem to consume this for custom visualization. Litmus also plans to provide more dashboard examples for standard/demo applications which the community can modify to suite their own needs, while serving as examples.

Closing this issue based on the above.

litmuschaos / litmus

Creating a Prometheus -> Grafana dashboard that measures chaos start/stop with Pod Delete #1472