Kubernetes CI Policy: remove egregiously perma-failing jobs

spiffxp commented 4 years ago

Part of https://github.com/kubernetes/test-infra/issues/18551

Why this is important:

jobs that have been failing for hundreds of days are a drain on community resources
the fact that they've been failing this long means we've been getting by without their signal, it's probably more economical to cut our losses rather than make diving saves

http://storage.googleapis.com/k8s-metrics/failures-latest.json provides a list of jobs that have been failing continuously based on results stored in GCS. Note that not everything stored in GCS comes from prow.k8s.io; we allow for federated test results via https://github.com/kubernetes/test-infra/blob/master/kettle/buckets.yaml

Good candidates for removal include:

failing > 365 days
runs on prow.k8s.io but is testing out-of-support releases

Make sure to include either @spiffxp or @BenTheElder on PRs for these. Not all of these are clear cut removals and we may want to make efforts to find a job owner or otherwise find a way to mitigate.

We should close this issue once we decide what a formal definition of "egregious" is, and verify that we've handled everything that meets it. We should then feed whatever we've learned here into a policy of maintaining job health going forward (which is basically the end goal of https://github.com/kubernetes/test-infra/issues/18599 as well)

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

BenTheElder commented 3 years ago

/remove-lifecycle rotten

spiffxp commented 3 years ago

We still have egregiously perma-failing jobs. For example, the top 3 from http://storage.googleapis.com/k8s-metrics/failures-latest.json

  "ci-kubernetes-node-kubelet-serial": {
    "failing_days": 1098
  },
  "ci-kubernetes-e2enode-ubuntu2-k8sstable3-gkespec": {
    "failing_days": 1021
  },
  "ci-kubernetes-e2e-gci-gce-statefulset": {
    "failing_days": 969
  },

spiffxp commented 3 years ago

https://github.com/kubernetes/test-infra/pull/21141 removed one

Need to refresh where we're at here.

liggitt commented 3 years ago

Jobs that fail 100% of Up or Test are good candidates - https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=%5E(Up%7CTest)%24

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

k8s-triage-robot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

k8s-ci-robot commented 3 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/test-infra/issues/18600#issuecomment-892079291): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

dims commented 3 years ago

/reopen /remove-lifecycle rotten

k8s-ci-robot commented 3 years ago

@dims: Reopened this issue.

In response to [this](https://github.com/kubernetes/test-infra/issues/18600#issuecomment-892082782): >/reopen >/remove-lifecycle rotten Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

spiffxp commented 3 years ago

/milestone v1.23

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

BenTheElder commented 2 years ago

/remove-lifecycle stale /lifecycle frozen These jobs aren't going anywhere and this has to be dealt with someday

dims commented 2 years ago

xref: https://github.com/kubernetes/kubernetes/issues/109521

dims commented 2 years ago

/assign

kubernetes / test-infra

Kubernetes CI Policy: remove egregiously perma-failing jobs #18600