Open howardjohn opened 5 years ago
What we have:
monitoring and alerting of Prow components: https://monitoring.prow.istio.io. These report to the Istio #test-alerts channel (and once https://github.com/istio/test-infra/pull/2610 goes through will report critical errors to #oncall channel)
CPU/Memory usage, relative to requests/limits (and other useful node and job metrics) provide by Stackdriver Prow dashboard. I am experimenting with various alerts here; once the alerts are properly tuned, I will push these to Slack as well.
What we do not have:
It would be useful to have metrics like
Nice to haves would be seeing this per job or something.
Looking at stackdriver, it seems we can get the GCE stats of the underlying nodes, but I don't see the Kubernetes metrics there.
Its possible I am also just looking in the wrong place and we have these already