sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.11k stars 1.29k forks source link

monitoring: improve kubernetes monitoring (pod scheduling, etc) #16309

Closed bobheadxi closed 3 years ago

bobheadxi commented 3 years ago

We recently ran into issues with unschedulable pods causing silently failing upgrades. @pecigonzalo suggested the following:

I would do it by just checking unschedulable pods metrics. There are quite a few Kube metrics which I believe we dont alert on but would provide this information. I would generally implement most of what is in https://monitoring.mixins.dev/kubernetes/ (basically https://github.com/kubernetes-monitoring/kubernetes-mixin) and https://gitlab.com/gitlab-com/runbooks/-/tree/master/ (lots of good monitoring examples and dashboards there) AFAIR there are a few for pod state

We have a "Kubernetes monitoring" section under each dashboard currently that we can expand with this information

bobheadxi commented 3 years ago

@pecigonzalo brought up an awesome resource we can use for this: https://awesome-prometheus-alerts.grep.to/rules#kubernetes