triage is bailing out on some failure clusters

spiffxp commented 5 years ago

What happened: Not seeing as much of TestVolumeProvision listed in go.k8s.io/triage as we would expect

A look at logs from a recent triage run reveals:

I0801 15:22:32.167] Clustering failures for 3393 unique tests...
#...
I0801 15:24:28.293]  245/3393, 141 failures, verify vendor-licenses
I0801 15:25:28.605] bailing early, taking too long!
#...
I0801 15:27:07.843] 1935/3393, 16 failures, k8s.io/kubernetes/test/integration/scheduler TestVolumeProvision
I0801 15:28:13.154] bailing early, taking too long!

During the next phase

I0801 15:29:35.407] Combining clustered failures for 3393 unique tests...
# ...
I0801 15:33:19.266]  902/3393, 51 clusters, verify vendor-licenses
# ...
I0801 15:40:01.305] 1955/3393, 14 clusters, k8s.io/kubernetes/test/integration/scheduler TestVolumeProvision

What you expected to happen: To see TestVolumeProvision listed in go.k8s.io/triage

How to reproduce it (as minimally and precisely as possible): Try running triage/update-summaries.sh (in its current state I would recommend commenting out the parts that write to buckets)

Please provide links to example occurrences, if any: See above

Anything else we need to know?:

My guess is there is something common between those two failures, and it's probably large failure text. If so, we could address either by truncating failure text more aggressively, or by increasing the timeout before bailing

/help /area triage /priority important-longterm

k8s-ci-robot commented 5 years ago

@spiffxp: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes/test-infra/issues/13730): > > > >**What happened**: >Not seeing as much of `TestVolumeProvision` listed in go.k8s.io/triage as we would expect > >A look at logs from [a recent triage run](https://storage.googleapis.com/kubernetes-jenkins/logs/ci-test-infra-triage/1156947678471393281/build-log.txt) reveals: >``` >I0801 15:22:32.167] Clustering failures for 3393 unique tests... >#... >I0801 15:24:28.293] 245/3393, 141 failures, verify vendor-licenses >I0801 15:25:28.605] bailing early, taking too long! >#... >I0801 15:27:07.843] 1935/3393, 16 failures, k8s.io/kubernetes/test/integration/scheduler TestVolumeProvision >I0801 15:28:13.154] bailing early, taking too long! >``` > >During the next phase >``` >I0801 15:29:35.407] Combining clustered failures for 3393 unique tests... ># ... >I0801 15:33:19.266] 902/3393, 51 clusters, verify vendor-licenses ># ... >I0801 15:40:01.305] 1955/3393, 14 clusters, k8s.io/kubernetes/test/integration/scheduler TestVolumeProvision >``` > >**What you expected to happen**: >To see `TestVolumeProvision` listed in go.k8s.io/triage > >**How to reproduce it (as minimally and precisely as possible)**: >Try running `triage/update-summaries.sh` (in its current state I would recommend commenting out the parts that write to buckets) > >**Please provide links to example occurrences, if any**: >See above > >**Anything else we need to know?**: > >My guess is there is something common between those two failures, and it's probably large failure text. If so, we could address either by truncating failure text more aggressively, or by increasing the timeout before bailing > >/help >/area triage >/priority important-longterm Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

spiffxp commented 5 years ago

FYI @liggitt @BenTheElder since this came up in #sig-testing (ref: https://kubernetes.slack.com/archives/C09QZ4DQB/p1564674490149000)

BenTheElder commented 5 years ago

took a quick poke, noting that python3 is being used for these files under bazel, but the image is pypy2 and probably old. will look at updating that while we're at it...

BenTheElder commented 5 years ago

triage is now broken entirely with https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-test-infra-triage/1157331701362331652

+ bq extract --compression GZIP --destination_format NEWLINE_DELIMITED_JSON k8s-gubernator:temp.triage gs://k8s-gubernator/triage_tests.json.gz
BigQuery error in extract operation: Error processing job
'k8s-gubernator:bqjob_r7ddb74232fc611d3_0000016c533a4156_1': Table
gs://k8s-gubernator/triage_tests.json.gz too large to be exported to a single
file. Specify a uri including a * to shard export. See 'Exporting data into one
or more files' in https://cloud.google.com/bigquery/docs/exporting-data.

spiffxp commented 5 years ago

triage failing entirely was fixed by https://github.com/kubernetes/test-infra/pull/13753

bailing out still happening eg: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-test-infra-triage/1158788723303780358/build-log.txt

I0806 17:22:18.564]  286/3584, 137 failures, verify vendor-licenses
I0806 17:23:18.640] bailing early, taking too long!
I0806 17:23:57.304] 1417/3584, 38 failures, k8s.io/kubernetes/test/integration/scheduler TestVolumeProvision
I0806 17:25:06.662] bailing early, taking too long!
I0806 17:25:15.805] 1930/3584, 19 failures, k8s.io/kubernetes/test/integration/scheduler TestRescheduleProvisioning
I0806 17:26:21.639] bailing early, taking too long!
I0806 17:26:30.919] 2245/3584, 8 failures, k8s.io/kubernetes/test/integration/scheduler TestPVAffinityConflict
I0806 17:27:34.862] bailing early, taking too long!

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 4 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/test-infra/issues/13730#issuecomment-570702496): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / test-infra

triage is bailing out on some failure clusters #13730