prometheus-operator / runbooks

https://runbooks.prometheus-operator.dev
Apache License 2.0
88 stars 164 forks source link

port runbooks from openshift #5

Closed paulfantom closed 2 years ago

paulfantom commented 2 years ago

More runbooks :)

@ArthurSens could you take a look? This is subproject is of more concern to kube-prometheus than the operator itself. This is because alerts from kube-prometheus point here by default.

netlify[bot] commented 2 years ago

✔️ Deploy Preview for distracted-northcutt-e0bccc ready!

🔨 Explore the source changes: 89112c890c308a1bc28b7b3dc25f38df1c4f8878

🔍 Inspect the deploy log: https://app.netlify.com/sites/distracted-northcutt-e0bccc/deploys/61851645ff58c800081a0c2a

😎 Browse the preview: https://deploy-preview-5--distracted-northcutt-e0bccc.netlify.app

ArthurSens commented 2 years ago

I've been looking at this PR from time to time and I always leave without really knowing what to do here 😬.

To me, personally, those runbooks aren't useful. In Gitpod, we have a policy that runbooks should be written in a way that the person on-call doesn't have to think at all. Just copy-paste commands and follow different flows depending on what those commands return.

We find that approach the least error-prone when getting paged in super inconvenient times and it's easier to implement some sort of auto-remediation.

Some runbooks here are not that straightforward, e.g. AlertmanagerFailedReload says "Check logs".


There are some other runbooks that even though are super well written(etcd runbooks), I don't have the knowledge to say if they are correct, nor do I have an environment that I could trigger that incident and follow the runbook to check if it really resolves the issue.


That being said, I don't want to be a blocker if you think those are good additions 🙂. I just don't know how to proceed here

paulfantom commented 2 years ago

I do think about those runbooks more like documentation that helps you during a crisis when you don't have anything else. A lot of folks installing kube-prometheus struggle with the number of shipped alerts and don't really know what to do when something fires. They do want to know what is the alert means, what is its impact, and what steps they could take to solve it, which means they literally want a runbook. And even though we cannot always define what particular commands need to be run (because of the myriad of deployments scenarios and environments), we still can throw a lifebuoy and show people what is going on when the alert fires. This knowledge can be later used to internally implement runbooks that are more aligned to the particular environment and maybe improve the ones we ship here.

There are some other runbooks that even though are super well written(etcd runbooks), I don't have the knowledge to say if they are correct, nor do I have an environment that I could trigger that incident and follow the runbook to check if it really resolves the issue.

Turn this around and you will be directly in position of many of kube-prometheus users. What would you do if an etcd alert fires and you don't have an internal runbook for it?

Some runbooks here are not that straightforward, e.g. AlertmanagerFailedReload says "Check logs".

Yes, I am aware, but at the same time, I don't have anything else that could be written here right now. My hope is that with time we can improve those parts and be more specific (maybe write what we are looking for in logs, etc.)