Steps to Visualize Litmus Metrics and Generate Notifications

Introduction

This issue contains the steps for setting up the scrape of litmus metrics by prometheus and instrumenting application dashboards on grafana with these metrics (as annotations). We will also cover the steps needed to receive slack notifications/alerts based on litmus metrics (esp. for chaos experiment failures).

Application Setup

Let us launch a simple application to test out the chaos observability stack being setup. In this case, I've used CNCF's podtato-head hello service for ease-of-use and illustration purposes.

Apply manifest to run the hello-server deployment

kubectl apply -f https://raw.githubusercontent.com/cncf/podtato-head/main/delivery/manifest/manifest.yaml

Obtain the LoadBalancer IP from the service (change to other service types like NodePort or do a port-forward as appropriate) and view the hello service app on the browser

LitmusChaos Infra Setup

You can either choose to directly install the latest chaos-operator (1.13.5) in the desired cluster OR setup the litmus portal control plane with the operator getting installed as part of the agent registration process (2.0.0-beta7)

Case-1: Chaos-Operator Setup

via kubectl

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.5.yaml 
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-admin-rbac.yaml

via helm

kubectl create ns litmus
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install chaos litmuschaos/litmus --namespace=litmus

Case-2: Litmus 2.0.0-Beta(7) Setup

via kubectl

kubectl apply -f https://litmuschaos.github.io/litmus/2.0.0-Beta/litmus-2.0.0-Beta.yaml

via helm

kubectl create ns litmus
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
helm install litmus-portal litmuschaos/litmus-2-0-0-beta --namespace litmus --devel

Verify that the litmus chaos operator (and control plane components, in case of 2.0.0-Beta) are up and running.

Monitoring Infra Setup

Install the kube-prometheus-stack via its helm chart

kubectl create ns monitoring
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prom prometheus-community/kube-prometheus-stack --namespace monitoring

Verify that the prometheus operator, prometheus statefulset cluster instance, alertmanager statefulset cluster instance and grafana deployment are installed and running (you will see node-exporter daemonset & the kube-state-metrics deployment as well)
Either edit the services of prometheus, alertmanager and grafana instances to use NodePort/LoadBalancer OR perform a port-forward operation to see the respective dashboards.
Install the Litmus Chaos-Exporter (In case 2.0.0-Beta was setup in the earlier step, the chaos-exporter would be automatically installed as part of the agent registration process. In which case, edit the deployment to add the TSDB_SCRAPE_INTERVAL environment variable as defined in the below manifest)

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: chaos-exporter
  name: chaos-exporter
  namespace: litmus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: chaos-exporter
  template:
    metadata:
      labels:
        app: chaos-exporter
    spec:
      serviceAccountName: litmus
      containers:
        - image: litmuschaos/chaos-exporter:1.13.5
          imagePullPolicy: Always
          name: chaos-exporter
          env:
            - name: TSDB_SCRAPE_INTERVAL
              value: "30"      
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: chaos-exporter
  name: chaos-exporter
  namespace: litmus
spec:
  ports:
    - port: 8080
      name: tcp
      protocol: TCP
      targetPort: 8080
  selector:
    app: chaos-exporter
  type: ClusterIP

Create the servicemonitor custom resource for the chaos exporter

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: chaos-exporter
    name: chaos-exporter
  name: chaos-exporter
  namespace: litmus
spec:
  endpoints:
  - interval: 1s
    port: tcp
  jobLabel: name
  namespaceSelector:
    matchNames:
    - litmus
  selector:
    matchLabels:
      app: chaos-exporter

Create a blackbox-exporter instance to scrape & provide metrics around the podtato-head service's availability and operational characteristics.

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-blackbox-exporter
  namespace: monitoring
  labels:
    app: prometheus-blackbox-exporter
data:
  blackbox.yaml: |
    modules:
      http_2xx:
        http:
          no_follow_redirects: false
          preferred_ip_protocol: ip4
          valid_http_versions:
          - HTTP/1.1
          - HTTP/2
          valid_status_codes: []
        prober: http
        timeout: 5s
---
kind: Service
apiVersion: v1
metadata:
  name: prometheus-blackbox-exporter
  namespace: monitoring
  labels:
    app: prometheus-blackbox-exporter
spec:
  type: ClusterIP
  ports:
    - name: http
      port: 9115
      protocol: TCP
  selector:
    app: prometheus-blackbox-exporter
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-blackbox-exporter
  namespace: monitoring
  labels:
    app: prometheus-blackbox-exporter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-blackbox-exporter
  template:
    metadata:
      labels:
        app: prometheus-blackbox-exporter
    spec:
      restartPolicy: Always
      containers:
        - name: blackbox-exporter
          image: "prom/blackbox-exporter:v0.15.1"
          imagePullPolicy: IfNotPresent
          securityContext:
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            runAsUser: 1000
          args:
            - "--config.file=/config/blackbox.yaml"
          resources:
            {}
          ports:
            - containerPort: 9115
              name: http
          livenessProbe:
            httpGet:
              path: /health
              port: http
          readinessProbe:
            httpGet:
              path: /health
              port: http
          volumeMounts:
            - mountPath: /config
              name: config
        - name: configmap-reload
          image: "jimmidyson/configmap-reload:v0.2.2"
          imagePullPolicy: "IfNotPresent"
          securityContext:
            runAsNonRoot: true
            runAsUser: 65534
          args:
            - --volume-dir=/etc/config
            - --webhook-url=http://localhost:9115/-/reload
          resources:
            {}
          volumeMounts:
            - mountPath: /etc/config
              name: config
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: prometheus-blackbox-exporter

Create the servicemonitor custom resource for the blackbox exporter with the podtato-head hello service as the target

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    name: prometheus-blackbox-exporter
    k8s-app: prometheus-blackbox-exporter
  name: prometheus-blackbox-exporter
  namespace: monitoring
spec:
  endpoints:
    - interval: 1s
      path: /probe
      port: http
      params:
        module:
        - http_2xx
        target:
        - "helloservice.demospace.svc.cluster.local:9000"
      metricRelabelings:
      - action: replace
        regex: (.*)
        replacement: my_local_service
        sourceLabels:
        - __param_target
        targetLabel: target
  selector:
    matchLabels:
      app: prometheus-blackbox-exporter

Add the newly created servicemonitors to the prometheus CR instance. You can do a kubectl apply of the following YAML to update the default version of the resource created on the cluster at the time of kube-prometheus-stack install (Note that the following CR spec consists of another flag evaluationInterval: 10s, apart from the servicemonitor addition. Its utility will be explained in later steps)

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  annotations:
    meta.helm.sh/release-name: prom
    meta.helm.sh/release-namespace: monitoring
  labels:
    app: kube-prometheus-stack-prometheus
    app.kubernetes.io/instance: prom
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 16.5.0
    chart: kube-prometheus-stack-16.5.0
    heritage: Helm
    release: prom
  name: prom-kube-prometheus-stack-prometheus
  namespace: monitoring
spec:
  alerting:
    alertmanagers:
    - apiVersion: v2
      name: prom-kube-prometheus-stack-alertmanager
      namespace: monitoring
      pathPrefix: /
      port: web
  enableAdminAPI: false
  evaluationInterval: 10s
  externalUrl: http://prom-kube-prometheus-stack-prometheus.monitoring:9090
  image: quay.io/prometheus/prometheus:v2.27.1
  listenLocal: false
  logFormat: logfmt
  logLevel: info
  paused: false
  podMonitorNamespaceSelector: {}
  podMonitorSelector:
    matchLabels:
      release: prom
  portName: web
  probeNamespaceSelector: {}
  probeSelector:
    matchLabels:
      release: prom
  replicas: 1
  retention: 10d
  routePrefix: /
  ruleNamespaceSelector: {}
  ruleSelector:
    matchLabels:
      app: kube-prometheus-stack
      release: prom
  securityContext:
    fsGroup: 2000
    runAsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: prom-kube-prometheus-stack-prometheus
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector:
    matchExpressions:
    - key: k8s-app
      operator: In
      values:
      - chaos-exporter
      - prometheus-blackbox-exporter
  shards: 1
  version: v2.27.1

Once the above changes are reconciled by the Prometheus operator, the litmus metrics should be available on the prometheus instance. Only the litmus-cluster* metrics appear initially (with values set to 0) when there are no chaosengines/chaosresults in the system. Newer metrics are created once the experiment runs occur.

You will also see metrics related to the hello-service availability and access latency etc., as exposed by the blackbox exporter

Alerting Configuration

Create a Prometheus rule to fire alerts when the chaos experiments are failed on the cluster under test (this example uses the default rulename and adds just the litmus alert after removing default entries. In case you'd like to retain them, modify the resource accordingly)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
    meta.helm.sh/release-name: prom
    meta.helm.sh/release-namespace: monitoring
    prometheus-operator-validated: "true"
  labels:
    app: kube-prometheus-stack
    app.kubernetes.io/instance: prom
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 16.5.0
    chart: kube-prometheus-stack-16.5.0
    heritage: Helm
    release: prom
  name: prom-kube-prometheus-stack-alertmanager.rules
spec:
  groups:
  - name: alertmanager.rules
    rules:
    - alert: LitmusExpFailureAlert
      annotations:
        message: |
          Chaos test {{ $labels.chaosengine_name }} has failed in namespace {{ $labels.chaosresult_namespace }} with probe success percentage {{ $labels.probe_success_percentage }}
      expr: litmuschaos_experiment_verdict{chaosresult_verdict="Fail",endpoint="tcp",job="chaos-monitor",service="chaos-monitor"}
        > 0
      labels:
        severity: critical

Once created, verify that the alert rule is reflected on the prometheus dashboard

Setup incoming webhooks for your slack to integrate with the Alert Manager. For this, you need the Slack API URL.
- Go to Slack -> Administration -> Manage apps.
- In the Manage apps directory, search for Incoming WebHooks and add it to your Slack workspace.
- Next, specify in which channel you’d like to receive notifications from Alertmanager. (I’ve created #litmus-alerts channel.) After you confirm and add Incoming WebHooks integration, the webhook URL (which is your Slack API URL) is displayed. Copy/note it.
Create the alert-manager configuration specification. In the <redacted> section place your slack API URL

global:
  resolve_timeout: 1m
receivers:
- name: slack-notifications
  slack_configs:
    - api_url: <redacted>
      channel: '#litmus-alerts'
      icon_url: https://raw.githubusercontent.com/litmuschaos/icons/master/litmus.png
      title: 'LitmusChaos Monitoring Event Notification'
      text: >-
        {{ range .Alerts }}
          *Description:* {{ .Annotations.message }} 
          *Details:*
          {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
          {{ end }}
        {{ end }}
route:
  receiver: slack-notifications
  group_by: ['alertname']
  group_wait: 15s
  group_interval: 30s
  routes:
  - receiver: 'slack-notifications'
    match: 
      severity: slack
templates:
- /etc/alertmanager/config/*.tmpl

Update this new configuration in the alert-manager to inject it into the alert-manager statefulset instance.
- First, encode into base64
```
cat alert-configuration.yaml | base64 -w0
```
- Edit the secret to replace existing content in .data.alertmanager.yaml field with newly obtained code from the previous step.
```
kubectl edit secret alertmanager-prom-kube-prometheus-stack-alertmanager -n monitoring
```
Verify that the appropriate configuration is reflected in the alert-manager console (this may take a few min)

Visualize Application Metrics Interleaved With Chaos Metrics On Grafana

Login to the grafana dashboard (prom-operator/admin are the likely credentials. Edit the secret prom-grafana to replace with desired username/password)
Add the prometheus data source in Grafana (there might be a pre-existing entry for the shown ds):

Create a sample dashboard to track the availability and access latency of the helloservice. You can use the average function of prometheus to sample these values over say, a 1m interval.
```
avg_over_time(probe_success{job="prometheus-blackbox-exporter", namespace="monitoring"}[60s:1s])*100
avg_over_time(probe_duration_seconds{job="prometheus-blackbox-exporter", namespace="monitoring"}[60s:1s])*1000
```
Under normal circumstances, these values equal 100(probe success percentage, which indicates the availability of the service) & ~15-18ms(rough access latency of the helloservice).

Annotate this grafana dashboard with the litmus_awaited_experiment metric to view the behaviour/deviation of these metrics under chaotic conditions, resulting in a "chaos-interleaved" app dashboard.

Alternatively, you can use this json to import a sample dashboard that implements the above.

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      },
      {
        "datasource": "Prometheus",
        "enable": true,
        "expr": "litmuschaos_awaited_experiments{chaosresult_namespace=\"litmus\",endpoint=\"tcp\",job=\"chaos-monitor\",namespace=\"litmus\",service=\"chaos-monitor\"}",
        "hide": false,
        "iconColor": "#C4162A",
        "name": "Show Chaos Period",
        "showIn": 0,
        "step": "5s"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 0,
  "id": 25,
  "links": [],
  "panels": [
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": null,
      "fieldConfig": {
        "defaults": {},
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 9,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "hiddenSeries": false,
      "id": 2,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 2,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.5.5",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "exemplar": true,
          "expr": "avg_over_time(probe_success{job=\"prometheus-blackbox-exporter\", namespace=\"monitoring\"}[60s:1s])*100",
          "interval": "",
          "legendFormat": "Probe Success percentage",
          "refId": "A"
        }
      ],
      "thresholds": [
        {
          "$$hashKey": "object:172",
          "colorMode": "critical",
          "fill": true,
          "line": true,
          "op": "lt",
          "value": 95,
          "yaxis": "left"
        }
      ],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Probe Success Percentage",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "$$hashKey": "object:147",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": "100",
          "min": "0",
          "show": true
        },
        {
          "$$hashKey": "object:148",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": "1",
          "min": "0",
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": null,
      "fieldConfig": {
        "defaults": {},
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 0
      },
      "hiddenSeries": false,
      "id": 4,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 2,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.5.5",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "exemplar": true,
          "expr": "avg_over_time(probe_duration_seconds{job=\"prometheus-blackbox-exporter\", namespace=\"monitoring\"}[60s:1s])*1000",
          "interval": "",
          "legendFormat": "Service Access Latency",
          "refId": "A"
        }
      ],
      "thresholds": [
        {
          "$$hashKey": "object:262",
          "colorMode": "critical",
          "fill": true,
          "line": true,
          "op": "gt",
          "value": 20,
          "yaxis": "left"
        }
      ],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Access Duration (in ms)",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "$$hashKey": "object:218",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "$$hashKey": "object:219",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }
  ],
  "schemaVersion": 27,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-5m",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "PodtatoHead-BlackBox-Exporter",
  "uid": "V8yDu66Gk",
  "version": 2
}

Prepare a Chaos Scenario

Now let us trigger a chaos experiment, say, pod-delete against the podtatohead helloservice to view impact of chaos on the grafana dashboard. We will also add some steady-state hypothesis constraints around the availability of this service in the experiment run using Litmus Probes.
- A http GET request made continuously (every 1s) to the helloservice endpoint/website should result in a 200 OK response every time indicating that the service is always available.
You must have noted by now that the helloservice has a single replica, which when deleted will cause loss of access to the service temporarily, causing probe failure, and hence the experiment failure. The idea is to deliberately cause a failure of the experiment and visualize the slack alert/notification based on the rule we have created earlier in this exercise.
To do this (a) Install the pod-delete chaosexperiment CR in the litmus namespace (b) Prepare a ChaosEngine with the probe specifications.

kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.13.5?file=charts/generic/pod-delete/experiment.yaml -n litmus

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: hello-chaos
  namespace: litmus
spec:
  appinfo:
    appns: 'demospace'
    applabel: 'app=helloservice'
    appkind: 'deployment'
  annotationCheck: 'false'
  engineState: 'active'
  chaosServiceAccount: litmus-admin
  monitoring: false
  jobCleanUpPolicy: 'retain'
  experiments:
    - name: pod-delete
      spec:
        probe:
          - name: http-probe
            type: "httpProbe"
            httpProbe/inputs:
              url: "http://104.154.133.35:31798"
              insecureSkipVerify: false
              method:
                get:
                  criteria: "=="
                  responseCode: "200"
            mode: "Continuous"
            runProperties:
              probeTimeout: 1
              interval: 1
              retry: 1
              probePollingInterval: 1
        components:
          env:
            # set chaos duration (in sec) as desired
            - name: TOTAL_CHAOS_DURATION
              value: '10'

            # set chaos interval (in sec) as desired
            - name: CHAOS_INTERVAL
              value: '10'

            # pod failures without '--force' & default terminationGracePeriodSeconds
            - name: FORCE
              value: 'false'

Trigger, Visualize & Receive Notifications on Chaos

Apply the ChaosEngine resource to launch the chaos. View the creation of experiment pods in litmusnamespace and termination of the helloservice pod in the demospacenamespace.

kubectl get pods -n litmus 

NAME                                 READY   STATUS    RESTARTS   AGE
chaos-monitor-758c6b7f7c-vwxhw       1/1     Running   0          25h
chaos-operator-ce-5ffd8d8c8b-6hx7j   1/1     Running   0          2d23h
hello-chaos-runner                   1/1     Running   0          8s
pod-delete-n2e1yq-g2q9q              1/1     Running   0          6s

kubectl get pods -n demospace

NAME                            READY   STATUS        RESTARTS   AGE
helloservice-79869dd7f6-jbmn8   0/1     Terminating   0       20m
helloservice-79869dd7f6-z7ctn   1/1     Running       0          2s

View the chaos interleaving on grafana to indicate the period of chaos and the corresponding deviation of the service characteristics. Notice that the probe_success_percentage drops below the desired 100% & the access_latency increases upwards of the cap of 20ms -- both of which are highlighted by the chaos annotation on the graph.

Once the experiment fails, you will be notified (within few seconds) of the failure on the slack channel setup to receive notifications.

Tips / Gotchas During Setup

There are a few things to take care of/note when performing the setup of the observability stack described in this exercise.

Ensure that the labels and port-names referred to in servicemonitors match with those in the kubernetes service resources for a given component.
The correct combination of the following properties is critical for successful reception of slack alerts/notifications:
- (a) EnvVar TSDB_SCRAPE_INTERVAL in the chaos-exporter deployment,
- (b) spec.endpoints.interval defined for the chaos-exporter's servicemonitor (essentially the scrape interval for the service),
- (c) for directive in the PrometheusRule for litmus-alert. It is best kept to 0sOR skipped (as in our example).
- (d) evaluationIntervalin the prometheus CR, as well as,
- (e) group_wait period within the alertmanager-configuration.yaml
The explanation/reason for this is provided below:
- The litmuchaos_experiment_verdict metric which provides info on a given experiment's pass/fail status is transient in nature and lasts only for the TSDB_SCRAPE_INTERVALperiod (if this env is not specified, it assumes the value of the chaos-exporter's scrape interval by prometheus) (In our case it is 30s - i.e., the litmus_experiment_verdict{chaosresult_verdict="Fail"} will stay set to 1 for 30s once the experiment fails).
- The reason for this metric's transience is because the chaosresult CR source contributing to this metric will/can enter other verdict states (awaited/stopped/pass/fail etc.,) if the experiment is re-executed from the same chaosengine. In other words, it is not cumulative. Since it lives for a specific period, the evaluationAlert of prometheus, which is the polling period for checking if an alert expression is satisfied/met - should be kept lower than the TSDB_SCRAPE_INTERVAL (in our case it is set to 10s).
- Prometheus, once it evaluates the alert rule to be true, looks at the for directive within the PrometheusRule/alertrule to see how long it needs the expr/condition to be met, before re-doing the evaluation. While this holds good for other system conditions (like say, a cpu utilization or load average), in our case, a failure alert must be fired immediately. Therefore it is better to keep the directive for: 0s or skip it altogether.
- Once all these conditions are met, the alert is fired from Prometheus & passed to the alert-manager. The alert-manager waits for the group_wait period before sending the slack notification to check if there are any more alerts within the defined group (in our case we group by alertname) before collating and sending it. In our case we want this to be low and the same (instance of) alert to be alive during the specified wait period. So we have selected a value (15s) lesser than the TSDB_SCRAPE_INTERVAL.

You can play around with these values in a way that makes for your overall environment/other services, while ensuring the general combination selected works within the principles explained.

There are other approaches to setup alerts - for example, from Grafana. These will be added in this thread soon!

Introduction

This issue contains the steps for setting up the scrape of litmus metrics by prometheus and instrumenting application dashboards on grafana with these metrics (as annotations). We will also cover the steps needed to receive slack notifications/alerts based on litmus metrics (esp. for chaos experiment failures).

Application Setup

Let us launch a simple application to test out the chaos observability stack being setup. In this case, I've used CNCF's podtato-head hello service for ease-of-use and illustration purposes.

Apply manifest to run the hello-server deployment
kubectl apply -f https://raw.githubusercontent.com/cncf/podtato-head/main/delivery/manifest/manifest.yaml
Obtain the LoadBalancer IP from the service (change to other service types like NodePort or do a port-forward as appropriate) and view the hello service app on the browser

@ksatchit FYI, since podtato-head is used as the sample in this guide (within the issue) it has to be noted that the sample Kubernetes manifest is not available in the main branch.

You may use the following GitHub release tag from the repository, to install the sample application.

kubectl apply -f https://raw.githubusercontent.com/cncf/podtato-head/release-0.1.0/delivery/manifest/manifest.yaml

There are other approaches to setup alerts - for example, from Grafana. These will be added in this thread soon!

Hi, do you have any example for it?

There are other approaches to setup alerts - for example, from Grafana. These will be added in this thread soon!

Hi, do you have any example for it?

The sock shop Grafana dashboard has sample alerts setup which can be used as a reference.

https://github.com/litmuschaos/litmus/blob/master/monitoring/grafana-dashboards/sock-shop/README.md

https://docs.litmuschaos.io/docs/integrations/grafana#fault-injection-and-system-failure-alerts

litmuschaos / tutorials