Open ksatchit opened 3 years ago
You can either choose to directly install the latest chaos-operator (1.13.5) in the desired cluster OR setup the litmus portal control plane with the operator getting installed as part of the agent registration process (2.0.0-beta7)
Case-1: Chaos-Operator Setup
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.5.yaml
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-admin-rbac.yaml
kubectl create ns litmus
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install chaos litmuschaos/litmus --namespace=litmus
Case-2: Litmus 2.0.0-Beta(7) Setup
kubectl apply -f https://litmuschaos.github.io/litmus/2.0.0-Beta/litmus-2.0.0-Beta.yaml
kubectl create ns litmus
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
helm install litmus-portal litmuschaos/litmus-2-0-0-beta --namespace litmus --devel
Verify that the litmus chaos operator (and control plane components, in case of 2.0.0-Beta) are up and running.
kubectl create ns monitoring
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prom prometheus-community/kube-prometheus-stack --namespace monitoring
Verify that the prometheus operator, prometheus statefulset cluster instance, alertmanager statefulset cluster instance and grafana deployment are installed and running (you will see node-exporter daemonset & the kube-state-metrics deployment as well)
Either edit the services of prometheus, alertmanager and grafana instances to use NodePort/LoadBalancer OR perform a port-forward operation to see the respective dashboards.
Install the Litmus Chaos-Exporter (In case 2.0.0-Beta was setup in the earlier step, the chaos-exporter would be automatically installed as part of the agent registration process. In which case, edit the deployment to add the TSDB_SCRAPE_INTERVAL environment variable as defined in the below manifest)
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: chaos-exporter
name: chaos-exporter
namespace: litmus
spec:
replicas: 1
selector:
matchLabels:
app: chaos-exporter
template:
metadata:
labels:
app: chaos-exporter
spec:
serviceAccountName: litmus
containers:
- image: litmuschaos/chaos-exporter:1.13.5
imagePullPolicy: Always
name: chaos-exporter
env:
- name: TSDB_SCRAPE_INTERVAL
value: "30"
---
apiVersion: v1
kind: Service
metadata:
labels:
app: chaos-exporter
name: chaos-exporter
namespace: litmus
spec:
ports:
- port: 8080
name: tcp
protocol: TCP
targetPort: 8080
selector:
app: chaos-exporter
type: ClusterIP
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
k8s-app: chaos-exporter
name: chaos-exporter
name: chaos-exporter
namespace: litmus
spec:
endpoints:
- interval: 1s
port: tcp
jobLabel: name
namespaceSelector:
matchNames:
- litmus
selector:
matchLabels:
app: chaos-exporter
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-blackbox-exporter
namespace: monitoring
labels:
app: prometheus-blackbox-exporter
data:
blackbox.yaml: |
modules:
http_2xx:
http:
no_follow_redirects: false
preferred_ip_protocol: ip4
valid_http_versions:
- HTTP/1.1
- HTTP/2
valid_status_codes: []
prober: http
timeout: 5s
---
kind: Service
apiVersion: v1
metadata:
name: prometheus-blackbox-exporter
namespace: monitoring
labels:
app: prometheus-blackbox-exporter
spec:
type: ClusterIP
ports:
- name: http
port: 9115
protocol: TCP
selector:
app: prometheus-blackbox-exporter
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-blackbox-exporter
namespace: monitoring
labels:
app: prometheus-blackbox-exporter
spec:
replicas: 1
selector:
matchLabels:
app: prometheus-blackbox-exporter
template:
metadata:
labels:
app: prometheus-blackbox-exporter
spec:
restartPolicy: Always
containers:
- name: blackbox-exporter
image: "prom/blackbox-exporter:v0.15.1"
imagePullPolicy: IfNotPresent
securityContext:
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
args:
- "--config.file=/config/blackbox.yaml"
resources:
{}
ports:
- containerPort: 9115
name: http
livenessProbe:
httpGet:
path: /health
port: http
readinessProbe:
httpGet:
path: /health
port: http
volumeMounts:
- mountPath: /config
name: config
- name: configmap-reload
image: "jimmidyson/configmap-reload:v0.2.2"
imagePullPolicy: "IfNotPresent"
securityContext:
runAsNonRoot: true
runAsUser: 65534
args:
- --volume-dir=/etc/config
- --webhook-url=http://localhost:9115/-/reload
resources:
{}
volumeMounts:
- mountPath: /etc/config
name: config
readOnly: true
volumes:
- name: config
configMap:
name: prometheus-blackbox-exporter
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
name: prometheus-blackbox-exporter
k8s-app: prometheus-blackbox-exporter
name: prometheus-blackbox-exporter
namespace: monitoring
spec:
endpoints:
- interval: 1s
path: /probe
port: http
params:
module:
- http_2xx
target:
- "helloservice.demospace.svc.cluster.local:9000"
metricRelabelings:
- action: replace
regex: (.*)
replacement: my_local_service
sourceLabels:
- __param_target
targetLabel: target
selector:
matchLabels:
app: prometheus-blackbox-exporter
kubectl apply
of the following YAML to update the default version of the resource created on the cluster at the time of kube-prometheus-stack install (Note that the following CR spec consists of another flag evaluationInterval: 10s
, apart from the servicemonitor addition. Its utility will be explained in later steps) apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
annotations:
meta.helm.sh/release-name: prom
meta.helm.sh/release-namespace: monitoring
labels:
app: kube-prometheus-stack-prometheus
app.kubernetes.io/instance: prom
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/part-of: kube-prometheus-stack
app.kubernetes.io/version: 16.5.0
chart: kube-prometheus-stack-16.5.0
heritage: Helm
release: prom
name: prom-kube-prometheus-stack-prometheus
namespace: monitoring
spec:
alerting:
alertmanagers:
- apiVersion: v2
name: prom-kube-prometheus-stack-alertmanager
namespace: monitoring
pathPrefix: /
port: web
enableAdminAPI: false
evaluationInterval: 10s
externalUrl: http://prom-kube-prometheus-stack-prometheus.monitoring:9090
image: quay.io/prometheus/prometheus:v2.27.1
listenLocal: false
logFormat: logfmt
logLevel: info
paused: false
podMonitorNamespaceSelector: {}
podMonitorSelector:
matchLabels:
release: prom
portName: web
probeNamespaceSelector: {}
probeSelector:
matchLabels:
release: prom
replicas: 1
retention: 10d
routePrefix: /
ruleNamespaceSelector: {}
ruleSelector:
matchLabels:
app: kube-prometheus-stack
release: prom
securityContext:
fsGroup: 2000
runAsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: prom-kube-prometheus-stack-prometheus
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- chaos-exporter
- prometheus-blackbox-exporter
shards: 1
version: v2.27.1
litmus-cluster*
metrics appear initially (with values set to 0) when there are no chaosengines/chaosresults in the system. Newer metrics are created once the experiment runs occur. apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
meta.helm.sh/release-name: prom
meta.helm.sh/release-namespace: monitoring
prometheus-operator-validated: "true"
labels:
app: kube-prometheus-stack
app.kubernetes.io/instance: prom
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/part-of: kube-prometheus-stack
app.kubernetes.io/version: 16.5.0
chart: kube-prometheus-stack-16.5.0
heritage: Helm
release: prom
name: prom-kube-prometheus-stack-alertmanager.rules
spec:
groups:
- name: alertmanager.rules
rules:
- alert: LitmusExpFailureAlert
annotations:
message: |
Chaos test {{ $labels.chaosengine_name }} has failed in namespace {{ $labels.chaosresult_namespace }} with probe success percentage {{ $labels.probe_success_percentage }}
expr: litmuschaos_experiment_verdict{chaosresult_verdict="Fail",endpoint="tcp",job="chaos-monitor",service="chaos-monitor"}
> 0
labels:
severity: critical
Setup incoming webhooks for your slack to integrate with the Alert Manager. For this, you need the Slack API URL.
Go to Slack -> Administration -> Manage apps.
In the Manage apps directory, search for Incoming WebHooks and add it to your Slack workspace.
Create the alert-manager configuration specification. In the <redacted>
section place your slack API URL
global:
resolve_timeout: 1m
receivers:
- name: slack-notifications
slack_configs:
- api_url: <redacted>
channel: '#litmus-alerts'
icon_url: https://raw.githubusercontent.com/litmuschaos/icons/master/litmus.png
title: 'LitmusChaos Monitoring Event Notification'
text: >-
{{ range .Alerts }}
*Description:* {{ .Annotations.message }}
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
route:
receiver: slack-notifications
group_by: ['alertname']
group_wait: 15s
group_interval: 30s
routes:
- receiver: 'slack-notifications'
match:
severity: slack
templates:
- /etc/alertmanager/config/*.tmpl
Update this new configuration in the alert-manager to inject it into the alert-manager statefulset instance.
First, encode into base64
cat alert-configuration.yaml | base64 -w0
Edit the secret to replace existing content in .data.alertmanager.yaml
field with newly obtained code from the previous step.
kubectl edit secret alertmanager-prom-kube-prometheus-stack-alertmanager -n monitoring
Verify that the appropriate configuration is reflected in the alert-manager console (this may take a few min)
Login to the grafana dashboard (prom-operator/admin are the likely credentials. Edit the secret prom-grafana
to replace with desired username/password)
Add the prometheus data source in Grafana (there might be a pre-existing entry for the shown ds):
Create a sample dashboard to track the availability and access latency of the helloservice. You can use the average
function of prometheus to sample these values over say, a 1m interval.
avg_over_time(probe_success{job="prometheus-blackbox-exporter", namespace="monitoring"}[60s:1s])*100
avg_over_time(probe_duration_seconds{job="prometheus-blackbox-exporter", namespace="monitoring"}[60s:1s])*1000
Under normal circumstances, these values equal 100
(probe success percentage, which indicates the availability of the service) & ~15-18ms
(rough access latency of the helloservice).
litmus_awaited_experiment
metric to view the behaviour/deviation of these metrics under chaotic conditions, resulting in a "chaos-interleaved" app dashboard. {
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
},
{
"datasource": "Prometheus",
"enable": true,
"expr": "litmuschaos_awaited_experiments{chaosresult_namespace=\"litmus\",endpoint=\"tcp\",job=\"chaos-monitor\",namespace=\"litmus\",service=\"chaos-monitor\"}",
"hide": false,
"iconColor": "#C4162A",
"name": "Show Chaos Period",
"showIn": 0,
"step": "5s"
}
]
},
"editable": true,
"gnetId": null,
"graphTooltip": 0,
"id": 25,
"links": [],
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": null,
"fieldConfig": {
"defaults": {},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 9,
"w": 12,
"x": 0,
"y": 0
},
"hiddenSeries": false,
"id": 2,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.5.5",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"exemplar": true,
"expr": "avg_over_time(probe_success{job=\"prometheus-blackbox-exporter\", namespace=\"monitoring\"}[60s:1s])*100",
"interval": "",
"legendFormat": "Probe Success percentage",
"refId": "A"
}
],
"thresholds": [
{
"$$hashKey": "object:172",
"colorMode": "critical",
"fill": true,
"line": true,
"op": "lt",
"value": 95,
"yaxis": "left"
}
],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Probe Success Percentage",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"$$hashKey": "object:147",
"format": "short",
"label": null,
"logBase": 1,
"max": "100",
"min": "0",
"show": true
},
{
"$$hashKey": "object:148",
"format": "short",
"label": null,
"logBase": 1,
"max": "1",
"min": "0",
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": null,
"fieldConfig": {
"defaults": {},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"hiddenSeries": false,
"id": 4,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 2,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.5.5",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"exemplar": true,
"expr": "avg_over_time(probe_duration_seconds{job=\"prometheus-blackbox-exporter\", namespace=\"monitoring\"}[60s:1s])*1000",
"interval": "",
"legendFormat": "Service Access Latency",
"refId": "A"
}
],
"thresholds": [
{
"$$hashKey": "object:262",
"colorMode": "critical",
"fill": true,
"line": true,
"op": "gt",
"value": 20,
"yaxis": "left"
}
],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Access Duration (in ms)",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"$$hashKey": "object:218",
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"$$hashKey": "object:219",
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
],
"schemaVersion": 27,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-5m",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "PodtatoHead-BlackBox-Exporter",
"uid": "V8yDu66Gk",
"version": 2
}
Now let us trigger a chaos experiment, say, pod-delete
against the podtatohead helloservice to view impact of chaos on the grafana dashboard. We will also add some steady-state hypothesis constraints around the availability of this service in the experiment run using Litmus Probes.
You must have noted by now that the helloservice has a single replica, which when deleted will cause loss of access to the service temporarily, causing probe failure, and hence the experiment failure. The idea is to deliberately cause a failure of the experiment and visualize the slack alert/notification based on the rule we have created earlier in this exercise.
To do this (a) Install the pod-delete chaosexperiment CR in the litmus namespace (b) Prepare a ChaosEngine with the probe specifications.
kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.13.5?file=charts/generic/pod-delete/experiment.yaml -n litmus
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: hello-chaos
namespace: litmus
spec:
appinfo:
appns: 'demospace'
applabel: 'app=helloservice'
appkind: 'deployment'
annotationCheck: 'false'
engineState: 'active'
chaosServiceAccount: litmus-admin
monitoring: false
jobCleanUpPolicy: 'retain'
experiments:
- name: pod-delete
spec:
probe:
- name: http-probe
type: "httpProbe"
httpProbe/inputs:
url: "http://104.154.133.35:31798"
insecureSkipVerify: false
method:
get:
criteria: "=="
responseCode: "200"
mode: "Continuous"
runProperties:
probeTimeout: 1
interval: 1
retry: 1
probePollingInterval: 1
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '10'
# set chaos interval (in sec) as desired
- name: CHAOS_INTERVAL
value: '10'
# pod failures without '--force' & default terminationGracePeriodSeconds
- name: FORCE
value: 'false'
litmus
namespace and termination of the helloservice pod in the demospace
namespace. kubectl get pods -n litmus
NAME READY STATUS RESTARTS AGE
chaos-monitor-758c6b7f7c-vwxhw 1/1 Running 0 25h
chaos-operator-ce-5ffd8d8c8b-6hx7j 1/1 Running 0 2d23h
hello-chaos-runner 1/1 Running 0 8s
pod-delete-n2e1yq-g2q9q 1/1 Running 0 6s
kubectl get pods -n demospace
NAME READY STATUS RESTARTS AGE
helloservice-79869dd7f6-jbmn8 0/1 Terminating 0 20m
helloservice-79869dd7f6-z7ctn 1/1 Running 0 2s
There are a few things to take care of/note when performing the setup of the observability stack described in this exercise.
Ensure that the labels and port-names referred to in servicemonitors match with those in the kubernetes service resources for a given component.
The correct combination of the following properties is critical for successful reception of slack alerts/notifications:
TSDB_SCRAPE_INTERVAL
in the chaos-exporter deployment,spec.endpoints.interval
defined for the chaos-exporter's servicemonitor (essentially the scrape interval for the service), for
directive in the PrometheusRule for litmus-alert
. It is best kept to 0s
OR skipped (as in our example). evaluationInterval
in the prometheus
CR, as well as, group_wait
period within the alertmanager-configuration.yaml The explanation/reason for this is provided below:
The litmuchaos_experiment_verdict
metric which provides info on a given experiment's pass/fail status is transient in nature and lasts only for the TSDB_SCRAPE_INTERVAL
period (if this env is not specified, it assumes the value of the chaos-exporter's scrape interval by prometheus) (In our case it is 30s - i.e., the litmus_experiment_verdict{chaosresult_verdict="Fail"}
will stay set to 1
for 30s once the experiment fails).
The reason for this metric's transience is because the chaosresult CR source contributing to this metric will/can enter other verdict states (awaited/stopped/pass/fail etc.,) if the experiment is re-executed from the same chaosengine. In other words, it is not cumulative. Since it lives for a specific period, the evaluationAlert
of prometheus, which is the polling period for checking if an alert expression is satisfied/met - should be kept lower than the TSDB_SCRAPE_INTERVAL
(in our case it is set to 10s).
Prometheus, once it evaluates the alert rule to be true, looks at the for
directive within the PrometheusRule/alertrule to see how long it needs the expr/condition to be met, before re-doing the evaluation. While this holds good for other system conditions (like say, a cpu utilization or load average), in our case, a failure alert must be fired immediately. Therefore it is better to keep the directive for: 0s
or skip it altogether.
Once all these conditions are met, the alert is fired from Prometheus & passed to the alert-manager. The alert-manager waits for the group_wait
period before sending the slack notification to check if there are any more alerts within the defined group (in our case we group by alertname) before collating and sending it. In our case we want this to be low and the same (instance of) alert to be alive during the specified wait period. So we have selected a value (15s) lesser than the TSDB_SCRAPE_INTERVAL
.
You can play around with these values in a way that makes for your overall environment/other services, while ensuring the general combination selected works within the principles explained.
There are other approaches to setup alerts - for example, from Grafana. These will be added in this thread soon!
Introduction
This issue contains the steps for setting up the scrape of litmus metrics by prometheus and instrumenting application dashboards on grafana with these metrics (as annotations). We will also cover the steps needed to receive slack notifications/alerts based on litmus metrics (esp. for chaos experiment failures).
Application Setup
Let us launch a simple application to test out the chaos observability stack being setup. In this case, I've used CNCF's podtato-head hello service for ease-of-use and illustration purposes.
- Apply manifest to run the hello-server deployment
kubectl apply -f https://raw.githubusercontent.com/cncf/podtato-head/main/delivery/manifest/manifest.yaml
- Obtain the LoadBalancer IP from the service (change to other service types like NodePort or do a port-forward as appropriate) and view the hello service app on the browser
@ksatchit
FYI, since podtato-head is used as the sample in this guide (within the issue) it has to be noted that the sample Kubernetes manifest is not available in the main
branch.
You may use the following GitHub release tag from the repository, to install the sample application.
kubectl apply -f https://raw.githubusercontent.com/cncf/podtato-head/release-0.1.0/delivery/manifest/manifest.yaml
There are other approaches to setup alerts - for example, from Grafana. These will be added in this thread soon!
Hi, do you have any example for it?
There are other approaches to setup alerts - for example, from Grafana. These will be added in this thread soon!
Hi, do you have any example for it?
The sock shop Grafana dashboard has sample alerts setup which can be used as a reference.
https://github.com/litmuschaos/litmus/blob/master/monitoring/grafana-dashboards/sock-shop/README.md
https://docs.litmuschaos.io/docs/integrations/grafana#fault-injection-and-system-failure-alerts
Introduction
This issue contains the steps for setting up the scrape of litmus metrics by prometheus and instrumenting application dashboards on grafana with these metrics (as annotations). We will also cover the steps needed to receive slack notifications/alerts based on litmus metrics (esp. for chaos experiment failures).
Application Setup
Let us launch a simple application to test out the chaos observability stack being setup. In this case, I've used CNCF's podtato-head hello service for ease-of-use and illustration purposes.
Obtain the LoadBalancer IP from the service (change to other service types like NodePort or do a port-forward as appropriate) and view the hello service app on the browser