kubernetes-sigs / usage-metrics-collector

High fidelity and scalable capacity and usage metrics for Kubernetes clusters
Apache License 2.0
118 stars 20 forks source link

[Flake] TestMetricsPrometheusCollector/over-max-extension-labels #8

Closed ehashman closed 4 months ago

ehashman commented 1 year ago

/kind flake

Triage link

https://storage.googleapis.com/k8s-triage/index.html?test=TestMetricsPrometheusCollector%2Fover-max-extension-labels

Output

=== RUN   TestMetricsPrometheusCollector/over-max-extension-labels
[190](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376#1:build-log.txt%3A190)
    integrationutil.go:232: 
[191](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376#1:build-log.txt%3A191)
            Error Trace:    integrationutil.go:232
[192](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376#1:build-log.txt%3A192)
                                        integration_test.go:65
[193](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376#1:build-log.txt%3A193)
                                        testutil.go:114
[194](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376#1:build-log.txt%3A194)
            Error:          Condition never satisfied
[195](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376#1:build-log.txt%3A195)
            Test:           TestMetricsPrometheusCollector/over-max-extension-labels
[196](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376#1:build-log.txt%3A196)
            Messages:       Build info version: dev, commit: none, date: unknown
[197](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376#1:build-log.txt%3A197)
                            collector config specifies 103 extension labels which exceed the max (100) unable to read collector config
[198](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376#1:build-log.txt%3A198)
--- FAIL: TestMetricsPrometheusCollector (31.93s)
[199](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376#1:build-log.txt%3A199)
    --- PASS: TestMetricsPrometheusCollector/basic (10.21s)
[200](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376#1:build-log.txt%3A200)
    --- FAIL: TestMetricsPrometheusCollector/over-max-extension-labels (10.03s)
[201](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376#1:build-log.txt%3A201)
FAIL
[202](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376#1:build-log.txt%3A202)
FAIL    sigs.k8s.io/usage-metrics-collector/cmd/metrics-prometheus-collector/integration    31.988s

from https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_usage-metrics-collector/7/pull-usage-metrics-collector-test/1613685163588325376

I don't think this should be failing!

ehashman commented 1 year ago

/triage accepted

Okay I think I figured it out. On failure runs I'm seeing /usr/local/go/pkg/tool/linux_amd64/link: signal: killed in the middle of the test run, which is why we end up with no output. See e.g. https://storage.googleapis.com/kubernetes-jenkins/logs/ci-usage-metrics-collector-test/1615814596579299328/build-log.txt

gotestsum --junitfile /logs/artifacts/junit_20230118-205318.xml
go: downloading github.com/stretchr/testify v1.7.0
go: downloading github.com/pmezard/go-difflib v1.0.0
∅  cmd/container-monitor
∅  cmd/container-monitor/cmd
∅  cmd/metrics-node-sampler
∅  cmd/metrics-node-sampler/cmd
✓  cmd/metrics-node-sampler/integration (11.485s)
∅  cmd/metrics-prometheus-collector
∅  cmd/metrics-prometheus-collector/cmd
/usr/local/go/pkg/tool/linux_amd64/link: signal: killed
✓  pkg/api/samplerserverv1alpha1 (98ms)
✓  pkg/collector (3.071s)
✖  cmd/metrics-prometheus-collector/integration (1m11.239s)
∅  pkg/api
∅  pkg/api/collectorcontrollerv1alpha1
∅  pkg/api/quotamanagementv1alpha1
∅  pkg/collector/api
∅  pkg/collector/utilization
∅  pkg/ctrstats
∅  pkg/log
∅  pkg/sampler
∅  pkg/sampler/api
∅  pkg/scheme
WARN invalid TestEvent: FAIL    sigs.k8s.io/usage-metrics-collector/pkg/testutil [build failed]
bad output from test2json: FAIL sigs.k8s.io/usage-metrics-collector/pkg/testutil [build failed]
∅  pkg/version
∅  pkg/watchconfig

=== Failed
=== FAIL: cmd/metrics-prometheus-collector/integration TestMetricsPrometheusCollector/over-max-extension-labels (40.16s)
    integrationutil.go:232: 
            Error Trace:    integrationutil.go:232
                                        integration_test.go:65
                                        testutil.go:114
            Error:          Condition never satisfied
            Test:           TestMetricsPrometheusCollector/over-max-extension-labels
            Messages:       Build info version: dev, commit: none, date: unknown
                            collector config specifies 103 extension labels which exceed the max (100) unable to read collector config
    --- FAIL: TestMetricsPrometheusCollector/over-max-extension-labels (40.16s)

=== FAIL: cmd/metrics-prometheus-collector/integration TestMetricsPrometheusCollector (71.20s)

=== Errors
/usr/local/go/pkg/tool/linux_amd64/link: signal: killed

DONE 103 tests, 2 failures, 1 error in 432.236s
make: *** [Makefile:77: test] Error 1

I think the tests are getting OOM-killed mid-build. Let me try to bump the memory available. Since they do pass sometimes I suspect doubling it should be sufficient.

ehashman commented 1 year ago

/reopen

unfortunately still seems to be flaking, but less frequently

k8s-ci-robot commented 1 year ago

@ehashman: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/usage-metrics-collector/issues/8#issuecomment-1412485366): >/reopen > >unfortunately still seems to be flaking, but less frequently Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
ehashman commented 1 year ago

It looks like on successful runs, the entire TestMetricsPrometheusCollector suite runs in <35s. In this case, this single test is timing out at the 40s mark. I think something occasionally causes the condition to never be true (something failing to start up?) but not sure what.

k8s-triage-robot commented 5 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

dashpole commented 4 months ago

unclear if this is still relevant. We can open new issues if it reoccurs.

/close

k8s-ci-robot commented 4 months ago

@dashpole: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/usage-metrics-collector/issues/8#issuecomment-1984125357): >unclear if this is still relevant. We can open new issues if it reoccurs. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
ehashman commented 4 months ago

Looks like this one is no longer flaking.