Closed ehashman closed 4 months ago
/triage accepted
Okay I think I figured it out. On failure runs I'm seeing /usr/local/go/pkg/tool/linux_amd64/link: signal: killed
in the middle of the test run, which is why we end up with no output. See e.g. https://storage.googleapis.com/kubernetes-jenkins/logs/ci-usage-metrics-collector-test/1615814596579299328/build-log.txt
gotestsum --junitfile /logs/artifacts/junit_20230118-205318.xml
go: downloading github.com/stretchr/testify v1.7.0
go: downloading github.com/pmezard/go-difflib v1.0.0
∅ cmd/container-monitor
∅ cmd/container-monitor/cmd
∅ cmd/metrics-node-sampler
∅ cmd/metrics-node-sampler/cmd
✓ cmd/metrics-node-sampler/integration (11.485s)
∅ cmd/metrics-prometheus-collector
∅ cmd/metrics-prometheus-collector/cmd
/usr/local/go/pkg/tool/linux_amd64/link: signal: killed
✓ pkg/api/samplerserverv1alpha1 (98ms)
✓ pkg/collector (3.071s)
✖ cmd/metrics-prometheus-collector/integration (1m11.239s)
∅ pkg/api
∅ pkg/api/collectorcontrollerv1alpha1
∅ pkg/api/quotamanagementv1alpha1
∅ pkg/collector/api
∅ pkg/collector/utilization
∅ pkg/ctrstats
∅ pkg/log
∅ pkg/sampler
∅ pkg/sampler/api
∅ pkg/scheme
WARN invalid TestEvent: FAIL sigs.k8s.io/usage-metrics-collector/pkg/testutil [build failed]
bad output from test2json: FAIL sigs.k8s.io/usage-metrics-collector/pkg/testutil [build failed]
∅ pkg/version
∅ pkg/watchconfig
=== Failed
=== FAIL: cmd/metrics-prometheus-collector/integration TestMetricsPrometheusCollector/over-max-extension-labels (40.16s)
integrationutil.go:232:
Error Trace: integrationutil.go:232
integration_test.go:65
testutil.go:114
Error: Condition never satisfied
Test: TestMetricsPrometheusCollector/over-max-extension-labels
Messages: Build info version: dev, commit: none, date: unknown
collector config specifies 103 extension labels which exceed the max (100) unable to read collector config
--- FAIL: TestMetricsPrometheusCollector/over-max-extension-labels (40.16s)
=== FAIL: cmd/metrics-prometheus-collector/integration TestMetricsPrometheusCollector (71.20s)
=== Errors
/usr/local/go/pkg/tool/linux_amd64/link: signal: killed
DONE 103 tests, 2 failures, 1 error in 432.236s
make: *** [Makefile:77: test] Error 1
I think the tests are getting OOM-killed mid-build. Let me try to bump the memory available. Since they do pass sometimes I suspect doubling it should be sufficient.
/reopen
unfortunately still seems to be flaking, but less frequently
@ehashman: Reopened this issue.
It looks like on successful runs, the entire TestMetricsPrometheusCollector suite runs in <35s. In this case, this single test is timing out at the 40s mark. I think something occasionally causes the condition to never be true (something failing to start up?) but not sure what.
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
/triage accepted
(org members only)/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
unclear if this is still relevant. We can open new issues if it reoccurs.
/close
@dashpole: Closing this issue.
Looks like this one is no longer flaking.
/kind flake
Triage link
https://storage.googleapis.com/k8s-triage/index.html?test=TestMetricsPrometheusCollector%2Fover-max-extension-labels
Output
from https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376
I don't think this should be failing!