open-fresh / bomb-squad

Automatic detection and silencing of high cardinality series in Prometheus.
Apache License 2.0
139 stars 12 forks source link

Bomb Squad: Suppressing Prometheus Cardinality Explosions

CircleCI Docker Repository on Quay

Bomb Squad is a sidecar to Kubernetes-deployed Prometheus instances that detects and suppresses cardinality explosions. It is a tool intended to bring operational stability and greater visibility in times of rapid cardinality inflation, keeping your Prometheus instances online and usable while providing clear indications that something is trying to blow up.

Status

Bomb Squad is currently an alpha project, with a few caveats of which you should be aware:

Suppressing the what now?

You might find now and again that one or more of your Prometheus scrape targets begins to expose some manner of super high-cardinality data as metric labels. Prometheus is awesome at handling "typical" high-cardinality behavior:

There are events, however, in which there is dramatic, sustained growth in the cardinality of one or more metrics. We call these events "cardinality explosions", and they can reduce your Prometheus instance(s) and any downstream receiving services to a smoldering heap in very short order.

Some examples of these events include:

Bomb Squad is designed to detect these events and, by way of standard Prometheus capabilities, suppress the negative behavior so that Prometheus can stay online and downstream services can continue to function reliably.

How does it work?

Bomb Squad is deployed as a sidecar within your Kubernetes Prometheus pods. One this is done, it does the following:

Run Bomb Squad Locally

There is a handy script, run-local/run-minikube.sh that will spin up a minikube environment for you that will contain the necessary components to play with and try out Bomb Squad locally. Steps:

make clean # just in case the image needs to be rebuilt by the minikube docker engine
cd run-local
./run-minikube.sh
minikube service prometheus

This will get things spun up and open a browser window with the stock Prometheus query UI. You can check out the test metric by querying for statspitter_high_card_test_gauge_vec. It's also worth taking a look at the bootstrapped recording rules, if you're curious, by visiting the Status -> Rules page.

Before triggering a cardinality explosion, it's recommended that you tail the bomb-squad container's logs. Our preferred method is with stern:

stern . -c bomb-squad

To trigger a cardinality explosion and consequently a suppression event by Bomb Squad, run:

# StatSpitter is a toy app that spits out ~100 new series per second on request
curl -i $(minikube service statspitter --url)/toggle

If you watch the bomb-squad container logs, you should see some detection and rule insertion messages go by after a few seconds. Bomb Squad automatically reloads the Prometheus config, so you won't need to take any further action to suppress the explosion!

You can view Bomb Squad's metrics in Prometheus by querying for bomb_squad_exploding_label_distinct_values. You can also view what metric.label combinations Bomb Squad is currently silencing by using the CLI in the running container:

kubectl exec <prometheus_pod_name> -c bomb-squad -- bs list

To remediate our simulated "bad code deploy" that caused the explosion, delete the statspitter pod to stop the explosion and dump the old exploded series from its registry:

kubectl delete pod -l app=statspitter

Finally, to remove the silence on our test metric:

kubectl exec <prometheus_pod_name> -c bomb-squad -- bs unsilence <metric.label as shown by bs list above>

Deploying Bomb Squad

Bomb Squad needs to be deployed as a sidecar container inside your Prometheus pod(s), and there are a couple of requirements to note:

A container spec along the lines of the following, added to your Prometheus pod spec, should do the trick:

spec:
  ...
  template:
    ...
    spec:
    ...
      containers:
        ...
        <prometheus container spec>
        ...
        - name: bomb-squad
          image: gcr.io/freshtracks-io/bomb-squad:latest
          args:
          - -prom-url=localhost:9090 # In case you run Prometheus on a non-standard port
          ports:
          - containerPort: 8080
            protocol: TCP
          volumeMounts:
          - mountPath: /etc/config/bomb-squad
            name: bomb-squad-rules
      volumes:
        ...
        - emptyDir: {}
          name: bomb-squad-rules