linki / chaoskube

chaoskube periodically kills random pods in your Kubernetes cluster.
MIT License
1.81k stars 120 forks source link
chaos chaos-engineering chaos-monkey kubernetes

chaoskube

GitHub release go-doc

chaoskube periodically kills random pods in your Kubernetes cluster.

chaoskube

Why

Test how your system behaves under arbitrary pod failures.

Example

Running it will kill a pod in any namespace every 10 minutes by default.

$ chaoskube
INFO[0000] starting up              dryRun=true interval=10m0s version=v0.21.0
INFO[0000] connecting to cluster    master="https://kube.you.me" serverVersion=v1.10.5+coreos.0
INFO[0000] setting pod filter       annotations= labels= minimumAge=0s namespaces=
INFO[0000] setting quiet times      daysOfYear="[]" timesOfDay="[]" weekdays="[]"
INFO[0000] setting timezone         location=UTC name=UTC offset=0
INFO[0001] terminating pod          name=kube-dns-v20-6ikos namespace=kube-system
INFO[0601] terminating pod          name=nginx-701339712-u4fr3 namespace=chaoskube
INFO[1201] terminating pod          name=kube-proxy-gke-earthcoin-pool-3-5ee87f80-n72s namespace=kube-system
INFO[1802] terminating pod          name=nginx-701339712-bfh2y namespace=chaoskube
INFO[2402] terminating pod          name=heapster-v1.2.0-1107848163-bhtcw namespace=kube-system
INFO[3003] terminating pod          name=l7-default-backend-v1.0-o2hc9 namespace=kube-system
INFO[3603] terminating pod          name=heapster-v1.2.0-1107848163-jlfcd namespace=kube-system
INFO[4203] terminating pod          name=nginx-701339712-bfh2y namespace=chaoskube
INFO[4804] terminating pod          name=nginx-701339712-51nt8 namespace=chaoskube
...

chaoskube allows to filter target pods by namespaces, labels, annotations and age as well as exclude certain weekdays, times of day and days of a year from chaos.

How

Helm

You can install chaoskube with Helm. Follow Helm's Quickstart Guide and then install the chaoskube chart.

$ helm repo add chaoskube https://linki.github.io/chaoskube/
$ helm install chaoskube chaoskube/chaoskube --atomic --namespace=chaoskube --create-namespace

Refer to chaoskube on kubeapps.com to learn how to configure it and to find other useful Helm charts.

Raw manifest

Refer to example manifest. Be sure to give chaoskube appropriate permissions using provided ClusterRole.

Configuration

By default chaoskube will be friendly and not kill anything. When you validated your target cluster you may disable dry-run mode by passing the flag --no-dry-run. You can also specify a more aggressive interval and other supported flags for your deployment.

If you're running in a Kubernetes cluster and want to target the same cluster then this is all you need to do.

If you want to target a different cluster or want to run it locally specify your cluster via the --master flag or provide a valid kubeconfig via the --kubeconfig flag. By default, it uses your standard kubeconfig path in your home. That means, whatever is the current context in there will be targeted.

If you want to increase or decrease the amount of chaos change the interval between killings with the --interval flag. Alternatively, you can increase the number of replicas of your chaoskube deployment.

Remember that chaoskube by default kills any pod in all your namespaces, including system pods and itself.

chaoskube provides a simple HTTP endpoint that can be used to check that it is running. This can be used for Kubernetes liveness and readiness probes. By default, this listens on port 8080. To disable, pass --metrics-address="" to chaoskube.

Filtering targets

However, you can limit the search space of chaoskube by providing label, annotation, and namespace selectors, pod name include/exclude patterns, as well as a minimum age setting.

$ chaoskube --labels 'app=mate,chaos,stage!=production'
...
INFO[0000] setting pod filter       labels="app=mate,chaos,stage!=production"

This selects all pods that have the label app set to mate, the label chaos set to anything and the label stage not set to production or unset.

You can filter target pods by namespace selector as well.

$ chaoskube --namespaces 'default,testing,staging'
...
INFO[0000] setting pod filter       namespaces="default,staging,testing"

This will filter for pods in the three namespaces default, staging and testing.

Namespaces can additionally be filtered by a namespace label selector.

$ chaoskube --namespace-labels='!integration'
...
INFO[0000] setting pod filter       namespaceLabels="!integration"

This will exclude all pods from namespaces with the label integration.

You can filter target pods by OwnerReference's kind selector.

$ chaoskube --kinds '!DaemonSet,!StatefulSet'
...
INFO[0000] setting pod filter       kinds="!DaemonSet,!StatefulSet"

This will exclude any DaemonSet and StatefulSet pods.

$ chaoskube --kinds 'DaemonSet'
...
INFO[0000] setting pod filter       kinds="DaemonSet"

This will only include any DaemonSet pods.

Please note: any include filter will automatically exclude all the pods with no OwnerReference defined.

You can filter pods by name:

$ chaoskube --included-pod-names 'foo|bar' --excluded-pod-names 'prod'
...
INFO[0000] setting pod filter       excludedPodNames=prod includedPodNames="foo|bar"

This will cause only pods whose name contains 'foo' or 'bar' and does not contain 'prod' to be targeted.

You can also exclude namespaces and mix and match with the label and annotation selectors.

$ chaoskube \
    --labels 'app=mate,chaos,stage!=production' \
    --annotations '!scheduler.alpha.kubernetes.io/critical-pod' \
    --namespaces '!kube-system,!production'
...
INFO[0000] setting pod filter       annotations="!scheduler.alpha.kubernetes.io/critical-pod" labels="app=mate,chaos,stage!=production" namespaces="!kube-system,!production"

This further limits the search space of the above label selector by also excluding any pods in the kube-system and production namespaces as well as ignore all pods that are marked as critical.

The annotation selector can also be used to run chaoskube as a cluster addon and allow pods to opt-in to being terminated as you see fit. For example, you could run chaoskube like this:

$ chaoskube --annotations 'chaos.alpha.kubernetes.io/enabled=true' --debug
...
INFO[0000] setting pod filter       annotations="chaos.alpha.kubernetes.io/enabled=true"
DEBU[0000] found candidates         count=0
DEBU[0000] no victim found

Unless you already use that annotation somewhere, this will initially ignore all of your pods (you can see the number of candidates in debug mode). You could then selectively opt-in individual deployments to chaos mode by annotating their pods with chaos.alpha.kubernetes.io/enabled=true.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  template:
    metadata:
      annotations:
        chaos.alpha.kubernetes.io/enabled: "true"
    spec:
      ...

You can exclude pods that have recently started by using the --minimum-age flag.

$ chaoskube --minimum-age 6h
...
INFO[0000] setting pod filter       minimumAge=6h0m0s

Limit the Chaos

You can limit the time when chaos is introduced by weekdays, time periods of a day, day of a year or all of them together.

Add a comma-separated list of abbreviated weekdays via the --excluded-weekdays options, a comma-separated list of time periods via the --excluded-times-of-day option and/or a comma-separated list of days of a year via the --excluded-days-of-year option and specify a --timezone by which to interpret them.

$ chaoskube \
    --excluded-weekdays=Sat,Sun \
    --excluded-times-of-day=22:00-08:00,11:00-13:00 \
    --excluded-days-of-year=Apr1,Dec24 \
    --timezone=Europe/Berlin
...
INFO[0000] setting quiet times      daysOfYear="[Apr 1 Dec24]" timesOfDay="[22:00-08:00 11:00-13:00]" weekdays="[Saturday Sunday]"
INFO[0000] setting timezone         location=Europe/Berlin name=CET offset=1

Use UTC, Local or pick a timezone name from the (IANA) tz database. If you're testing chaoskube from your local machine then Local makes the most sense. Once you deploy chaoskube to your cluster you should deploy it with a specific timezone, e.g. where most of your team members are living, so that both your team and chaoskube have a common understanding when a particular weekday begins and ends, for instance. If your team is spread across multiple time zones it's probably best to pick UTC which is also the default. Picking the wrong timezone shifts the meaning of a particular weekday by a couple of hours between you and the server.

Flags

Option Environment Description Default
--interval CHAOSKUBE_INTERVAL interval between pod terminations 10m
--labels CHAOSKUBE_LABELS label selector to filter pods by (matches everything)
--annotations CHAOSKUBE_ANNOTATIONS annotation selector to filter pods by (matches everything)
--kinds CHAOSKUBE_KINDS owner's kind selector to filter pods by (all kinds)
--namespaces CHAOSKUBE_NAMESPACES namespace selector to filter pods by (all namespaces)
--namespace-labels CHAOSKUBE_NAMESPACE_LABELS label selector to filter namespaces and its pods by (all namespaces)
--included-pod-names CHAOSKUBE_INCLUDED_POD_NAMES regular expression pattern for pod names to include (all included)
--excluded-pod-names CHAOSKUBE_EXCLUDED_POD_NAMES regular expression pattern for pod names to exclude (none excluded)
--excluded-weekdays CHAOSKUBE_EXCLUDED_WEEKDAYS weekdays when chaos is to be suspended, e.g. "Sat,Sun" (no weekday excluded)
--excluded-times-of-day CHAOSKUBE_EXCLUDED_TIMES_OF_DAY times of day when chaos is to be suspended, e.g. "22:00-08:00" (no times of day excluded)
--excluded-days-of-year CHAOSKUBE_EXCLUDED_DAYS_OF_YEAR days of a year when chaos is to be suspended, e.g. "Apr1,Dec24" (no days of year excluded)
--timezone CHAOSKUBE_TIMEZONE timezone from tz database, e.g. "America/New_York", "UTC" or "Local" (UTC)
--max-runtime CHAOSKUBE_MAX_RUNTIME Maximum runtime before chaoskube exits -1s (infinite time)
--max-kill CHAOSKUBE_MAX_KILL Specifies the maximum number of pods to be terminated per interval 1
--minimum-age CHAOSKUBE_MINIMUM_AGE Minimum age to filter pods by 0s (matches every pod)
--dry-run CHAOSKUBE_DRY_RUN don't kill pods, only log what would have been done true
--log-format CHAOSKUBE_LOG_FORMAT specify the format of the log messages. Options are text and json text
--log-caller CHAOSKUBE_LOG_CALLER include the calling function name and location in the log messages false
--slack-webhook CHAOSKUBE_SLACK_WEBHOOK The address of the slack webhook for notifications disabled
--client-namespace-scope CHAOSKUBE_CLIENT_NAMESPACE_SCOPE Scope Kubernetes API calls to the given namespace (all namespaces)

Related work

There are several other projects that allow you to create some chaos in your Kubernetes cluster.

Acknowledgements

This project wouldn't be where it is with the ideas and help of several awesome contributors:

Contributing

Feel free to create issues or submit pull requests.