Bomb Squad is a sidecar to Kubernetes-deployed Prometheus instances that detects and suppresses cardinality explosions. It is a tool intended to bring operational stability and greater visibility in times of rapid cardinality inflation, keeping your Prometheus instances online and usable while providing clear indications that something is trying to blow up.
Bomb Squad is currently an alpha project, with a few caveats of which you should be aware:
config
package (doing so naively will pull in all of the service discovery vendor code, which hurts).You might find now and again that one or more of your Prometheus scrape targets begins to expose some manner of super high-cardinality data as metric labels. Prometheus is awesome at handling "typical" high-cardinality behavior:
There are events, however, in which there is dramatic, sustained growth in the cardinality of one or more metrics. We call these events "cardinality explosions", and they can reduce your Prometheus instance(s) and any downstream receiving services to a smoldering heap in very short order.
Some examples of these events include:
Bomb Squad is designed to detect these events and, by way of standard Prometheus capabilities, suppress the negative behavior so that Prometheus can stay online and downstream services can continue to function reliably.
Bomb Squad is deployed as a sidecar within your Kubernetes Prometheus pods. One this is done, it does the following:
metric.labelName
in Bomb Squad ConfigMap entryThere is a handy script, run-local/run-minikube.sh
that will spin up a minikube environment for you that will contain the necessary components to play with and try out Bomb Squad locally.
Steps:
make clean # just in case the image needs to be rebuilt by the minikube docker engine
cd run-local
./run-minikube.sh
minikube service prometheus
This will get things spun up and open a browser window with the stock Prometheus query UI. You can check out the test metric by querying for statspitter_high_card_test_gauge_vec
. It's also worth taking a look at the bootstrapped recording rules, if you're curious, by visiting the Status -> Rules page.
Before triggering a cardinality explosion, it's recommended that you tail the bomb-squad container's logs. Our preferred method is with stern
:
stern . -c bomb-squad
To trigger a cardinality explosion and consequently a suppression event by Bomb Squad, run:
# StatSpitter is a toy app that spits out ~100 new series per second on request
curl -i $(minikube service statspitter --url)/toggle
If you watch the bomb-squad container logs, you should see some detection and rule insertion messages go by after a few seconds. Bomb Squad automatically reloads the Prometheus config, so you won't need to take any further action to suppress the explosion!
You can view Bomb Squad's metrics in Prometheus by querying for bomb_squad_exploding_label_distinct_values
.
You can also view what metric.label
combinations Bomb Squad is currently silencing by using the CLI in the running container:
kubectl exec <prometheus_pod_name> -c bomb-squad -- bs list
To remediate our simulated "bad code deploy" that caused the explosion, delete the statspitter pod to stop the explosion and dump the old exploded series from its registry:
kubectl delete pod -l app=statspitter
Finally, to remove the silence on our test metric:
kubectl exec <prometheus_pod_name> -c bomb-squad -- bs unsilence <metric.label as shown by bs list above>
Bomb Squad needs to be deployed as a sidecar container inside your Prometheus pod(s), and there are a couple of requirements to note:
emptyDir
volume so that it has a place from which to bootstrap its rulesA container spec along the lines of the following, added to your Prometheus pod spec, should do the trick:
spec:
...
template:
...
spec:
...
containers:
...
<prometheus container spec>
...
- name: bomb-squad
image: gcr.io/freshtracks-io/bomb-squad:latest
args:
- -prom-url=localhost:9090 # In case you run Prometheus on a non-standard port
ports:
- containerPort: 8080
protocol: TCP
volumeMounts:
- mountPath: /etc/config/bomb-squad
name: bomb-squad-rules
volumes:
...
- emptyDir: {}
name: bomb-squad-rules