kubernetes / test-infra

Test infrastructure for the Kubernetes project.
Apache License 2.0
3.83k stars 2.65k forks source link

Alert for boskos pool #9383

Closed krzyzacy closed 4 years ago

krzyzacy commented 6 years ago

we should alert when the main pool (gce, gke) volume is lower than ~25%, I'll poke around when I have time.

cc @BenTheElder /area boskos /assign

krzyzacy commented 6 years ago

cc @cjwagner - do we ever set up other alerts from velodrome? They support email or slack, I'll probably hook it up with #testing-ops channel, and do we have a slack token stored somewhere?

cjwagner commented 6 years ago

Yeah, I believe that Quintin had some alerts configured on velodrome at some point. I'm not sure where they are sent though. There should be a slack token in a secret in the service cluster.

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

krzyzacy commented 5 years ago

/remove-lifecycle stale will get to that some day....

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

krzyzacy commented 5 years ago

AFAIK we'll move to prow's monitoring stack

krzyzacy commented 5 years ago

/remove-lifecycle stale /assign @clarketm /unassign

ixdy commented 4 years ago

What's left to be done now that Boskos is using the Prow monitoring stack (#15344)?

cjwagner commented 4 years ago

What's left to be done now that Boskos is using the Prow monitoring stack (#15344)?

Once we have some data from the new metrics we'll be able to pick an appropriate alert threshold and add prometheus alerts like the ones defined in this dir: https://github.com/kubernetes/test-infra/tree/master/prow/cluster/monitoring/mixins/prometheus

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

BenTheElder commented 4 years ago

isn't this done @cjwagner ?

ixdy commented 4 years ago

Yeah, I think we can call this done: https://github.com/kubernetes/test-infra/blob/47050c4743c0381165543bcc587a3094c2c5c179/prow/cluster/monitoring/mixins/prometheus/boskos_alerts.libsonnet#L4-L28 We even had an alert last week for a resource type we'd deleted but which was still being tracked by Boskos.

/close

k8s-ci-robot commented 4 years ago

@ixdy: Closing this issue.

In response to [this](https://github.com/kubernetes/test-infra/issues/9383#issuecomment-589230609): >Yeah, I think we can call this done: >https://github.com/kubernetes/test-infra/blob/47050c4743c0381165543bcc587a3094c2c5c179/prow/cluster/monitoring/mixins/prometheus/boskos_alerts.libsonnet#L4-L28 >We even had [an alert](https://kubernetes.slack.com/archives/C7J9RP96G/p1581614640143300) last week for a resource type we'd deleted but which was still being tracked by Boskos. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
ixdy commented 4 years ago

also x-ref #15412