prometheus-operator / runbooks

https://runbooks.prometheus-operator.dev
Apache License 2.0
93 stars 172 forks source link

Request: more in-depth contributing guidelines #9

Closed mac-chaffee closed 2 years ago

mac-chaffee commented 2 years ago

I'm happy to have found this repo recently, since I'm a user of kube-prometheus-stack which recently updated to link to this runbook by default!

I have created an internal runbook before, so I could provide some extra depth to the documentation. However, I'm a little unsure of the context, goals, and guidelines behind this repo. Maybe we could expand https://runbooks.prometheus-operator.dev/docs/add-runbook/ to include some rules for good contributions?

For example:

Interested to hear your thoughts on what the direction is with this runbook since it looks promising already!

paulfantom commented 2 years ago

We had some discussion about what we want this repo to be in https://github.com/prometheus-operator/runbooks/pull/5

If this is meant to be the official runbook of prometheus-operator, I'd assume we'd have to make sure the advice given is generic enough to fit the majority of prometheus-operator users, right? One guideline could be "Try to ensure the 'Diagnosis/Mitigation' sections are applicable to all certified k8s distributions" or something.

Yes, you are correct. The purpose of this repository is to have a documentation about every alert shipped by kube-prometheus (not only by prometheus-operator). In the long run we are aiming to support as much k8s flavors as possible. However right now we are mostly focusing on having better coverage as many runbooks are missing.

I like the idea of adding separate guidelines section in https://github.com/prometheus-operator/runbooks/blob/main/content/docs/add-runbook.md and we can start with If possible try to ensure the 'Diagnosis/Mitigation' sections are applicable to all certified kubernetes distributions.

However due to very early stage and lack of runbooks for many alerts, I don't think we should start gate keeping. IMHO in current state it would be preferable to have full coverage of all alerts first and refine content later (this includes testing on multiple k8s platforms).

Maybe a length guideline? If this runbook is meant primarily to be something that panicked sysadmins read at 2AM, short is better. But my internal runbook is a bit more long-winded since my users are kubernetes novices who need an in-depth description of what "KubePodCrashLooping" really means. Maybe a middle-ground would be to add a "click to expand" button that would expand to show a more in-depth description for novice users (that experts could just ignore)?

Right now we are aiming mostly at folks who are novices and don't have much insight into what to do with alerts shipped in kube-prometheus. This is because we found out that experienced SREs will have their own custom-written runbooks (potentially based on content from this repo) mostly to include remediation strategies that are specific to particular environments. That said I really like the idea of having "click to expand" sections! :+1:

mac-chaffee commented 2 years ago

Awesome, that all makes sense to me! Work is keeping me a little busy right now, but I hope to make some contributions from my internal runbook soon with those guidelines in mind