litmuschaos / tutorials

Consists of codelabs to perform standard user flows for LitmusChaos
Apache License 2.0
5 stars 13 forks source link

Steps to Visualize Litmus Metrics and Generate Notifications #6

Open ksatchit opened 3 years ago

ksatchit commented 3 years ago

Introduction

This issue contains the steps for setting up the scrape of litmus metrics by prometheus and instrumenting application dashboards on grafana with these metrics (as annotations). We will also cover the steps needed to receive slack notifications/alerts based on litmus metrics (esp. for chaos experiment failures).

Application Setup

Let us launch a simple application to test out the chaos observability stack being setup. In this case, I've used CNCF's podtato-head hello service for ease-of-use and illustration purposes.

kubectl apply -f https://raw.githubusercontent.com/cncf/podtato-head/main/delivery/manifest/manifest.yaml
ksatchit commented 3 years ago

LitmusChaos Infra Setup

You can either choose to directly install the latest chaos-operator (1.13.5) in the desired cluster OR setup the litmus portal control plane with the operator getting installed as part of the agent registration process (2.0.0-beta7)

Case-1: Chaos-Operator Setup

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.5.yaml 
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-admin-rbac.yaml
kubectl create ns litmus
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install chaos litmuschaos/litmus --namespace=litmus

Case-2: Litmus 2.0.0-Beta(7) Setup

kubectl apply -f https://litmuschaos.github.io/litmus/2.0.0-Beta/litmus-2.0.0-Beta.yaml
kubectl create ns litmus
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
helm install litmus-portal litmuschaos/litmus-2-0-0-beta --namespace litmus --devel 

Verify that the litmus chaos operator (and control plane components, in case of 2.0.0-Beta) are up and running.

ksatchit commented 3 years ago

Monitoring Infra Setup

kubectl create ns monitoring
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prom prometheus-community/kube-prometheus-stack --namespace monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: chaos-exporter
  name: chaos-exporter
  namespace: litmus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: chaos-exporter
  template:
    metadata:
      labels:
        app: chaos-exporter
    spec:
      serviceAccountName: litmus
      containers:
        - image: litmuschaos/chaos-exporter:1.13.5
          imagePullPolicy: Always
          name: chaos-exporter
          env:
            - name: TSDB_SCRAPE_INTERVAL
              value: "30"      
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: chaos-exporter
  name: chaos-exporter
  namespace: litmus
spec:
  ports:
    - port: 8080
      name: tcp
      protocol: TCP
      targetPort: 8080
  selector:
    app: chaos-exporter
  type: ClusterIP
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: chaos-exporter
    name: chaos-exporter
  name: chaos-exporter
  namespace: litmus
spec:
  endpoints:
  - interval: 1s
    port: tcp
  jobLabel: name
  namespaceSelector:
    matchNames:
    - litmus
  selector:
    matchLabels:
      app: chaos-exporter
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-blackbox-exporter
  namespace: monitoring
  labels:
    app: prometheus-blackbox-exporter
data:
  blackbox.yaml: |
    modules:
      http_2xx:
        http:
          no_follow_redirects: false
          preferred_ip_protocol: ip4
          valid_http_versions:
          - HTTP/1.1
          - HTTP/2
          valid_status_codes: []
        prober: http
        timeout: 5s
---
kind: Service
apiVersion: v1
metadata:
  name: prometheus-blackbox-exporter
  namespace: monitoring
  labels:
    app: prometheus-blackbox-exporter
spec:
  type: ClusterIP
  ports:
    - name: http
      port: 9115
      protocol: TCP
  selector:
    app: prometheus-blackbox-exporter
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-blackbox-exporter
  namespace: monitoring
  labels:
    app: prometheus-blackbox-exporter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-blackbox-exporter
  template:
    metadata:
      labels:
        app: prometheus-blackbox-exporter
    spec:
      restartPolicy: Always
      containers:
        - name: blackbox-exporter
          image: "prom/blackbox-exporter:v0.15.1"
          imagePullPolicy: IfNotPresent
          securityContext:
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            runAsUser: 1000
          args:
            - "--config.file=/config/blackbox.yaml"
          resources:
            {}
          ports:
            - containerPort: 9115
              name: http
          livenessProbe:
            httpGet:
              path: /health
              port: http
          readinessProbe:
            httpGet:
              path: /health
              port: http
          volumeMounts:
            - mountPath: /config
              name: config
        - name: configmap-reload
          image: "jimmidyson/configmap-reload:v0.2.2"
          imagePullPolicy: "IfNotPresent"
          securityContext:
            runAsNonRoot: true
            runAsUser: 65534
          args:
            - --volume-dir=/etc/config
            - --webhook-url=http://localhost:9115/-/reload
          resources:
            {}
          volumeMounts:
            - mountPath: /etc/config
              name: config
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: prometheus-blackbox-exporter
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    name: prometheus-blackbox-exporter
    k8s-app: prometheus-blackbox-exporter
  name: prometheus-blackbox-exporter
  namespace: monitoring
spec:
  endpoints:
    - interval: 1s
      path: /probe
      port: http
      params:
        module:
        - http_2xx
        target:
        - "helloservice.demospace.svc.cluster.local:9000"
      metricRelabelings:
      - action: replace
        regex: (.*)
        replacement: my_local_service
        sourceLabels:
        - __param_target
        targetLabel: target
  selector:
    matchLabels:
      app: prometheus-blackbox-exporter
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  annotations:
    meta.helm.sh/release-name: prom
    meta.helm.sh/release-namespace: monitoring
  labels:
    app: kube-prometheus-stack-prometheus
    app.kubernetes.io/instance: prom
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 16.5.0
    chart: kube-prometheus-stack-16.5.0
    heritage: Helm
    release: prom
  name: prom-kube-prometheus-stack-prometheus
  namespace: monitoring
spec:
  alerting:
    alertmanagers:
    - apiVersion: v2
      name: prom-kube-prometheus-stack-alertmanager
      namespace: monitoring
      pathPrefix: /
      port: web
  enableAdminAPI: false
  evaluationInterval: 10s
  externalUrl: http://prom-kube-prometheus-stack-prometheus.monitoring:9090
  image: quay.io/prometheus/prometheus:v2.27.1
  listenLocal: false
  logFormat: logfmt
  logLevel: info
  paused: false
  podMonitorNamespaceSelector: {}
  podMonitorSelector:
    matchLabels:
      release: prom
  portName: web
  probeNamespaceSelector: {}
  probeSelector:
    matchLabels:
      release: prom
  replicas: 1
  retention: 10d
  routePrefix: /
  ruleNamespaceSelector: {}
  ruleSelector:
    matchLabels:
      app: kube-prometheus-stack
      release: prom
  securityContext:
    fsGroup: 2000
    runAsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: prom-kube-prometheus-stack-prometheus
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector:
    matchExpressions:
    - key: k8s-app
      operator: In
      values:
      - chaos-exporter
      - prometheus-blackbox-exporter
  shards: 1
  version: v2.27.1

image

image

ksatchit commented 3 years ago

Alerting Configuration

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
    meta.helm.sh/release-name: prom
    meta.helm.sh/release-namespace: monitoring
    prometheus-operator-validated: "true"
  labels:
    app: kube-prometheus-stack
    app.kubernetes.io/instance: prom
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 16.5.0
    chart: kube-prometheus-stack-16.5.0
    heritage: Helm
    release: prom
  name: prom-kube-prometheus-stack-alertmanager.rules
spec:
  groups:
  - name: alertmanager.rules
    rules:
    - alert: LitmusExpFailureAlert
      annotations:
        message: |
          Chaos test {{ $labels.chaosengine_name }} has failed in namespace {{ $labels.chaosresult_namespace }} with probe success percentage {{ $labels.probe_success_percentage }}
      expr: litmuschaos_experiment_verdict{chaosresult_verdict="Fail",endpoint="tcp",job="chaos-monitor",service="chaos-monitor"}
        > 0
      labels:
        severity: critical

image

global:
  resolve_timeout: 1m
receivers:
- name: slack-notifications
  slack_configs:
    - api_url: <redacted>
      channel: '#litmus-alerts'
      icon_url: https://raw.githubusercontent.com/litmuschaos/icons/master/litmus.png
      title: 'LitmusChaos Monitoring Event Notification'
      text: >-
        {{ range .Alerts }}
          *Description:* {{ .Annotations.message }} 
          *Details:*
          {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
          {{ end }}
        {{ end }}
route:
  receiver: slack-notifications
  group_by: ['alertname']
  group_wait: 15s
  group_interval: 30s
  routes:
  - receiver: 'slack-notifications'
    match: 
      severity: slack
templates:
- /etc/alertmanager/config/*.tmpl

image

ksatchit commented 3 years ago

Visualize Application Metrics Interleaved With Chaos Metrics On Grafana

image

image

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      },
      {
        "datasource": "Prometheus",
        "enable": true,
        "expr": "litmuschaos_awaited_experiments{chaosresult_namespace=\"litmus\",endpoint=\"tcp\",job=\"chaos-monitor\",namespace=\"litmus\",service=\"chaos-monitor\"}",
        "hide": false,
        "iconColor": "#C4162A",
        "name": "Show Chaos Period",
        "showIn": 0,
        "step": "5s"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 0,
  "id": 25,
  "links": [],
  "panels": [
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": null,
      "fieldConfig": {
        "defaults": {},
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 9,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "hiddenSeries": false,
      "id": 2,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 2,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.5.5",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "exemplar": true,
          "expr": "avg_over_time(probe_success{job=\"prometheus-blackbox-exporter\", namespace=\"monitoring\"}[60s:1s])*100",
          "interval": "",
          "legendFormat": "Probe Success percentage",
          "refId": "A"
        }
      ],
      "thresholds": [
        {
          "$$hashKey": "object:172",
          "colorMode": "critical",
          "fill": true,
          "line": true,
          "op": "lt",
          "value": 95,
          "yaxis": "left"
        }
      ],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Probe Success Percentage",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "$$hashKey": "object:147",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": "100",
          "min": "0",
          "show": true
        },
        {
          "$$hashKey": "object:148",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": "1",
          "min": "0",
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": null,
      "fieldConfig": {
        "defaults": {},
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 0
      },
      "hiddenSeries": false,
      "id": 4,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 2,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.5.5",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "exemplar": true,
          "expr": "avg_over_time(probe_duration_seconds{job=\"prometheus-blackbox-exporter\", namespace=\"monitoring\"}[60s:1s])*1000",
          "interval": "",
          "legendFormat": "Service Access Latency",
          "refId": "A"
        }
      ],
      "thresholds": [
        {
          "$$hashKey": "object:262",
          "colorMode": "critical",
          "fill": true,
          "line": true,
          "op": "gt",
          "value": 20,
          "yaxis": "left"
        }
      ],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Access Duration (in ms)",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "$$hashKey": "object:218",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "$$hashKey": "object:219",
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }
  ],
  "schemaVersion": 27,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-5m",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "PodtatoHead-BlackBox-Exporter",
  "uid": "V8yDu66Gk",
  "version": 2
}
ksatchit commented 3 years ago

Prepare a Chaos Scenario

kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.13.5?file=charts/generic/pod-delete/experiment.yaml -n litmus 
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: hello-chaos
  namespace: litmus
spec:
  appinfo:
    appns: 'demospace'
    applabel: 'app=helloservice'
    appkind: 'deployment'
  annotationCheck: 'false'
  engineState: 'active'
  chaosServiceAccount: litmus-admin
  monitoring: false
  jobCleanUpPolicy: 'retain'
  experiments:
    - name: pod-delete
      spec:
        probe:
          - name: http-probe
            type: "httpProbe"
            httpProbe/inputs:
              url: "http://104.154.133.35:31798"
              insecureSkipVerify: false
              method:
                get:
                  criteria: "=="
                  responseCode: "200"
            mode: "Continuous"
            runProperties:
              probeTimeout: 1
              interval: 1
              retry: 1
              probePollingInterval: 1
        components:
          env:
            # set chaos duration (in sec) as desired
            - name: TOTAL_CHAOS_DURATION
              value: '10'

            # set chaos interval (in sec) as desired
            - name: CHAOS_INTERVAL
              value: '10'

            # pod failures without '--force' & default terminationGracePeriodSeconds
            - name: FORCE
              value: 'false'
ksatchit commented 3 years ago

Trigger, Visualize & Receive Notifications on Chaos

kubectl get pods -n litmus 

NAME                                 READY   STATUS    RESTARTS   AGE
chaos-monitor-758c6b7f7c-vwxhw       1/1     Running   0          25h
chaos-operator-ce-5ffd8d8c8b-6hx7j   1/1     Running   0          2d23h
hello-chaos-runner                   1/1     Running   0          8s
pod-delete-n2e1yq-g2q9q              1/1     Running   0          6s
kubectl get pods -n demospace

NAME                            READY   STATUS        RESTARTS   AGE
helloservice-79869dd7f6-jbmn8   0/1     Terminating   0       20m
helloservice-79869dd7f6-z7ctn   1/1     Running       0          2s

image

image

ksatchit commented 3 years ago

Tips / Gotchas During Setup

There are a few things to take care of/note when performing the setup of the observability stack described in this exercise.

You can play around with these values in a way that makes for your overall environment/other services, while ensuring the general combination selected works within the principles explained.

ksatchit commented 3 years ago

There are other approaches to setup alerts - for example, from Grafana. These will be added in this thread soon!

chirangaalwis commented 3 years ago

Introduction

This issue contains the steps for setting up the scrape of litmus metrics by prometheus and instrumenting application dashboards on grafana with these metrics (as annotations). We will also cover the steps needed to receive slack notifications/alerts based on litmus metrics (esp. for chaos experiment failures).

Application Setup

Let us launch a simple application to test out the chaos observability stack being setup. In this case, I've used CNCF's podtato-head hello service for ease-of-use and illustration purposes.

  • Apply manifest to run the hello-server deployment
kubectl apply -f https://raw.githubusercontent.com/cncf/podtato-head/main/delivery/manifest/manifest.yaml
  • Obtain the LoadBalancer IP from the service (change to other service types like NodePort or do a port-forward as appropriate) and view the hello service app on the browser

image

@ksatchit FYI, since podtato-head is used as the sample in this guide (within the issue) it has to be noted that the sample Kubernetes manifest is not available in the main branch.

You may use the following GitHub release tag from the repository, to install the sample application.

kubectl apply -f https://raw.githubusercontent.com/cncf/podtato-head/release-0.1.0/delivery/manifest/manifest.yaml
mbecca commented 2 years ago

There are other approaches to setup alerts - for example, from Grafana. These will be added in this thread soon!

Hi, do you have any example for it?

ishangupta-ds commented 2 years ago

There are other approaches to setup alerts - for example, from Grafana. These will be added in this thread soon!

Hi, do you have any example for it?

The sock shop Grafana dashboard has sample alerts setup which can be used as a reference.

https://github.com/litmuschaos/litmus/blob/master/monitoring/grafana-dashboards/sock-shop/README.md

https://docs.litmuschaos.io/docs/integrations/grafana#fault-injection-and-system-failure-alerts