open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.12k stars 2.39k forks source link

[receiver/prometheus] Flaky test TestLabelValueLimitConfig #11753

Closed dmitryax closed 2 years ago

dmitryax commented 2 years ago
=== RUN   TestLabelValueLimitConfig
=== RUN   TestLabelValueLimitConfig/target1
    metrics_receiver_helper_test.go:170: 
            Error Trace:    metrics_receiver_helper_test.go:170
                                        metrics_receiver_labels_test.go:172
                                        metrics_receiver_helper_test.go:608
            Error:          "1" is not less than or equal to "0"
            Test:           TestLabelValueLimitConfig/target1
            Messages:       want at least 1 valid scrapes, but got 0
=== RUN   TestLabelValueLimitConfig/target2
--- FAIL: TestLabelValueLimitConfig (5.30s)
    --- FAIL: TestLabelValueLimitConfig/target1 (0.00s)
    --- PASS: TestLabelValueLimitConfig/target2 (0.00s)

Seen in https://github.com/open-telemetry/opentelemetry-collector-contrib/runs/7058818357?check_suite_focus=true

jspaleta commented 2 years ago

@djaglowski I think all the prom receiver flaky tests are symptoms of a race condition in how the commonly used testComponent func and the http server routine associated with mockPrometheus object operates for all the tests.

Can you assign me to this and the other open flaky prom receiver tests? I think the fix is similar to what i did for the carbon receiver, need to put in a timeout,tick wait loop before calling testComponent in every flaky test.
Maybe there's a way to abstract it so every test gets this automatically as part of how testComponent operates so it.

jspaleta commented 2 years ago

Okay I think I found it. There may be a wait group being used incorrectly in the Mock Prometheus server, with the Done call being done in a racy way. I'm going to try to sythetically produce a race locally and then fix.

djaglowski commented 2 years ago

Thanks for looking into these flaky tests! I've assigned this one to you. If there are others you think are related, please call them out specifically and I can assign those as well.

jspaleta commented 2 years ago

Okay I can reproduce the problem locally quite easily with a one line change in the tests.

Here's the reproducer change: I synthetically slow down the ServeHTTP function by introducing a short time.Sleep(500 * time.Millisecond) into the function:

func (mp *mockPrometheus) ServeHTTP(rw http.ResponseWriter, req *http.Request) {
        mp.mu.Lock()
        defer mp.mu.Unlock()
        time.Sleep(500 * time.Millisecond)
...

With this change I can reproduce the problem locally on every run, with every test marked flaky.

Not only that but the test http server serving up the metrics fails to close properly. the Close() call is blocked by what I assume is active receiver connections. But all the tests are configured to close the receiver as part of T.Cleanup() which only happens well after the server Close()

go test -v -run TestLabelValueLimitConfig
=== RUN   TestLabelValueLimitConfig
=== RUN   TestLabelValueLimitConfig/target1
    metrics_receiver_helper_test.go:205: 
            Error Trace:    /home/jspaleta/sources/orgs/jspaleta/opentelemetry-collector-contrib/receiver/prometheusreceiver/metrics_receiver_helper_test.go:205
                                        /home/jspaleta/sources/orgs/jspaleta/opentelemetry-collector-contrib/receiver/prometheusreceiver/metrics_receiver_labels_test.go:171
                                        /home/jspaleta/sources/orgs/jspaleta/opentelemetry-collector-contrib/receiver/prometheusreceiver/metrics_receiver_helper_test.go:643
            Error:          "1" is not less than or equal to "0"
            Test:           TestLabelValueLimitConfig/target1
            Messages:       want at least 1 valid scrapes, but got 0
=== RUN   TestLabelValueLimitConfig/target2
2022/08/04 22:39:15 httptest.Server blocked in Close after 5 seconds, waiting for connections:
  *net.TCPConn 0xc00071ee30 127.0.0.1:42102 in state active
  *net.TCPConn 0xc00071ea18 127.0.0.1:41972 in state active
  *net.TCPConn 0xc0006145c8 127.0.0.1:42018 in state active
  *net.TCPConn 0xc000614460 127.0.0.1:41956 in state active
  *net.TCPConn 0xc000712860 127.0.0.1:41900 in state active
  *net.TCPConn 0xc000cda030 127.0.0.1:41990 in state active
  *net.TCPConn 0xc0006145d0 127.0.0.1:42032 in state active
  *net.TCPConn 0xc000131c18 127.0.0.1:42076 in state active
  *net.TCPConn 0xc000712868 127.0.0.1:41904 in state active
  *net.TCPConn 0xc000614300 127.0.0.1:41920 in state active
  *net.TCPConn 0xc000131c10 127.0.0.1:42070 in state active
  *net.TCPConn 0xc000010308 127.0.0.1:41930 in state active
  *net.TCPConn 0xc00071ea08 127.0.0.1:41936 in state active
  *net.TCPConn 0xc00071ea10 127.0.0.1:41966 in state active
  *net.TCPConn 0xc00071ee40 127.0.0.1:42118 in state active
  *net.TCPConn 0xc000712dd0 127.0.0.1:42082 in state active
  *net.TCPConn 0xc000cda020 127.0.0.1:41944 in state active
  *net.TCPConn 0xc000cda198 127.0.0.1:42052 in state active
  *net.TCPConn 0xc000cda1a0 127.0.0.1:42062 in state active
  *net.TCPConn 0xc000cda190 127.0.0.1:42040 in state active
  *net.TCPConn 0xc000cda028 127.0.0.1:41988 in state active
  *net.TCPConn 0xc00071eb78 127.0.0.1:42006 in state active
  *net.TCPConn 0xc0006145c0 127.0.0.1:42012 in state active
  *net.TCPConn 0xc000131858 127.0.0.1:42014 in state active
  *net.TCPConn 0xc000131128 127.0.0.1:41958 in state active
  *net.TCPConn 0xc00071ee28 127.0.0.1:42090 in state active
  *net.TCPConn 0xc0006149d8 127.0.0.1:42108 in state active
--- FAIL: TestLabelValueLimitConfig (25.11s)
    --- FAIL: TestLabelValueLimitConfig/target1 (0.00s)
    --- PASS: TestLabelValueLimitConfig/target2 (0.00s)

That one change to add a 500 msec sleep is enough to cause the validation functions defined to fail consistently for me in all the flaky tests.

This isn't a simple fix. Someone who understands the details of the validation functions needs to look at this again. That additional delay in the http server is enough to invalidate the reciever scrapes.. and its not clear why...and that shouldn't be happening. a 500 msec delay isn't unrealistic.

There are scrape requests coming in, the waitgroup logic is working correctly to make sure enough requests make it through the ServerHTTP function before validation functions fire. So its not a race.

But those scrapes are just "invalid" accordingly to the particular validation logic in use...and its not clear why.

Is there an undocumented http request timeouts in play here in how the receiver functions that the 500 msec delay is tripping?

The fact that a simple additional time.Wait() in the mock http server serving up metrics to scrape causes this to be reproducible suggests to me there is a real bug in the receiver behavior that needs to be sorted out. I really shouldn't be able to have the tests fall over like this with a simple sleep in the http request handling.

jspaleta commented 2 years ago

As promised, here a commented branch with the reproducer sleep in place and including additional diagnostic logging to try to understand what is going on with.

https://github.com/jspaleta/opentelemetry-collector-contrib/tree/js/diagnose-flaky-prom-tests

Someone who has a better understanding of how the receiver's http connections work should take a look a this. I really get the feeling the receiver has an http connection timeout that is too tight, but I'm not sure how to increase the allow request response time budget.

Here's output of a local run of this branch What's important to note is that my new waitForScrapeResults function that tries to wait for the receiver to have done sufficiently enough scrapes is returning before the ServerHTTP even fully processes the first request. So that tells me the receiver isn't actually waiting long enough for the http server its scraping to respond.

This feels like a bug.. but if its not wee need a better mechanism in the testing framework to tell if the receiver actually got enough of the right kind of scrapes. Because whatever the receiver is doing right now.. it doesn't look like its waiting for the server to respond.

go test -v -run TestLabelValueLimitConfig
=== RUN   TestLabelValueLimitConfig
start waitForScrapeResults
waitForSscapeResults tick start
waitForSscapeResults target: target1 has 0 scrapes and wants 1
aborting waitForSscapeResults insufficient scrapes for target: target1
waitForSscapeResults tick start
waitForSscapeResults target: target1 has 0 scrapes and wants 1
aborting waitForSscapeResults insufficient scrapes for target: target1
waitForSscapeResults tick start
waitForSscapeResults target: target1 has 0 scrapes and wants 1
aborting waitForSscapeResults insufficient scrapes for target: target1
waitForSscapeResults tick start
waitForSscapeResults target: target1 has 0 scrapes and wants 1
aborting waitForSscapeResults insufficient scrapes for target: target1
waitForSscapeResults tick start
waitForSscapeResults target: target1 has 0 scrapes and wants 1
aborting waitForSscapeResults insufficient scrapes for target: target1
waitForSscapeResults tick start
waitForSscapeResults target: target1 has 0 scrapes and wants 1
aborting waitForSscapeResults insufficient scrapes for target: target1
waitForSscapeResults tick start
waitForSscapeResults target: target1 has 0 scrapes and wants 1
aborting waitForSscapeResults insufficient scrapes for target: target1
waitForSscapeResults tick start
waitForSscapeResults target: target1 has 0 scrapes and wants 1
aborting waitForSscapeResults insufficient scrapes for target: target1
waitForSscapeResults tick start
waitForSscapeResults target: target1 has 0 scrapes and wants 1
aborting waitForSscapeResults insufficient scrapes for target: target1
waitForSscapeResults tick start
waitForSscapeResults target: target1 has 0 scrapes and wants 1
aborting waitForSscapeResults insufficient scrapes for target: target1
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
waitForSscapeResults tick start
waitForSscapeResults target: target1 has 4 scrapes and wants 1
waitForSscapeResults target: target2 has 4 scrapes and wants 1
call blocking waitgroup Wait()
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 0 page code: 200
ServerHTTP:: response Status: 200
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 0 page code: 200
ServerHTTP:: response Status: 200
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 1 page code: N/A
ServerHTTP index >= pages for endpoint /target2/metrics, call waitgroup Done()
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 1 page code: N/A
ServerHTTP index >= pages for endpoint /target1/metrics, call waitgroup Done()
ServerHTTP:: response Status: 404
waitgroup Wait() finished
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
=== RUN   TestLabelValueLimitConfig/target1
    metrics_receiver_helper_test.go:222: 
            Error Trace:    /home/jspaleta/sources/orgs/jspaleta/opentelemetry-collector-contrib/receiver/prometheusreceiver/metrics_receiver_helper_test.go:222
                                        /home/jspaleta/sources/orgs/jspaleta/opentelemetry-collector-contrib/receiver/prometheusreceiver/metrics_receiver_labels_test.go:171
                                        /home/jspaleta/sources/orgs/jspaleta/opentelemetry-collector-contrib/receiver/prometheusreceiver/metrics_receiver_helper_test.go:666
            Error:          "1" is not less than or equal to "0"
            Test:           TestLabelValueLimitConfig/target1
            Messages:       want at least 1 valid scrapes, but got 0
=== RUN   TestLabelValueLimitConfig/target2
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 2 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 2 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 3 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 3 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 4 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 4 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 5 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 5 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 6 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
2022/08/05 12:03:38 httptest.Server blocked in Close after 5 seconds, waiting for connections:
  *net.TCPConn 0xc00048a6d8 127.0.0.1:53352 in state active
  *net.TCPConn 0xc0006209e8 127.0.0.1:53386 in state active
  *net.TCPConn 0xc00048a840 127.0.0.1:53426 in state active
  *net.TCPConn 0xc00071c9e8 127.0.0.1:53446 in state active
  *net.TCPConn 0xc00057ed88 127.0.0.1:53262 in state active
  *net.TCPConn 0xc00052b0a8 127.0.0.1:53298 in state active
  *net.TCPConn 0xc00057ead8 127.0.0.1:53224 in state active
  *net.TCPConn 0xc00048a830 127.0.0.1:53384 in state active
  *net.TCPConn 0xc00057ed90 127.0.0.1:53272 in state active
  *net.TCPConn 0xc00071c300 127.0.0.1:53210 in state active
  *net.TCPConn 0xc00057eee8 127.0.0.1:53284 in state active
  *net.TCPConn 0xc0006209e0 127.0.0.1:53254 in state active
  *net.TCPConn 0xc00071c728 127.0.0.1:53328 in state active
  *net.TCPConn 0xc00052b218 127.0.0.1:53368 in state active
  *net.TCPConn 0xc00048a848 127.0.0.1:53438 in state active
  *net.TCPConn 0xc00071c308 127.0.0.1:53220 in state active
  *net.TCPConn 0xc000620b48 127.0.0.1:53406 in state active
  *net.TCPConn 0xc000620880 127.0.0.1:53230 in state active
  *net.TCPConn 0xc00057eef0 127.0.0.1:53316 in state active
  *net.TCPConn 0xc0006209d8 127.0.0.1:53240 in state active
  *net.TCPConn 0xc00057f1a8 127.0.0.1:53372 in state active
  *net.TCPConn 0xc00071c730 127.0.0.1:53398 in state active
  *net.TCPConn 0xc000620b50 127.0.0.1:53410 in state active
  *net.TCPConn 0xc00052b0b0 127.0.0.1:53302 in state active
  *net.TCPConn 0xc00052b0b8 127.0.0.1:53340 in state active
  *net.TCPConn 0xc00048a838 127.0.0.1:53402 in state active
  *net.TCPConn 0xc0006205c8 127.0.0.1:53208 in state active
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 6 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 7 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 7 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 8 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 8 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 9 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 9 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 10 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 10 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 11 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 11 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 12 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 12 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 13 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 13 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 14 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 14 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 15 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 15 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 16 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 16 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 17 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 17 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 18 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 18 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 19 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 19 page code: N/A
ServerHTTP:: response Status: 404
--- FAIL: TestLabelValueLimitConfig (25.07s)
    --- FAIL: TestLabelValueLimitConfig/target1 (0.00s)
    --- PASS: TestLabelValueLimitConfig/target2 (0.00s)
FAIL
exit status 1
FAIL    github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver   25.108s
jspaleta commented 2 years ago

Okay I refactor the new waitFor function to use the Server's pov about when it thinks its served enough requests to let the tests run.

The diagnostic order of operations look better but there is still a problem,

== RUN   TestLabelValueLimitConfig
start waitForScrapeResults
waitForSscapeResults tick start targets:2 served:0
waitForSscapeResults tick start targets:2 served:0
waitForSscapeResults tick start targets:2 served:0
waitForSscapeResults tick start targets:2 served:0
waitForSscapeResults tick start targets:2 served:0
waitForSscapeResults tick start targets:2 served:0
waitForSscapeResults tick start targets:2 served:0
waitForSscapeResults tick start targets:2 served:0
waitForSscapeResults tick start targets:2 served:0
waitForSscapeResults tick start targets:2 served:0
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
waitForSscapeResults tick start targets:2 served:0
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 0 page code: 200
ServerHTTP:: response Status: 200
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
waitForSscapeResults tick start targets:2 served:0
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 0 page code: 200
ServerHTTP:: response Status: 200
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
waitForSscapeResults tick start targets:2 served:0
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 1 page code: N/A
ServerHTTP index >= pages for endpoint /target2/metrics, call waitgroup Done()
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
waitForSscapeResults tick start targets:2 served:1
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 1 page code: N/A
ServerHTTP index >= pages for endpoint /target1/metrics, call waitgroup Done()
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
waitForSscapeResults tick start targets:2 served:2
call blocking waitgroup Wait()
waitgroup Wait() finished
=== RUN   TestLabelValueLimitConfig/target1
    metrics_receiver_helper_test.go:203: 
            Error Trace:    /home/jspaleta/sources/orgs/jspaleta/opentelemetry-collector-contrib/receiver/prometheusreceiver/metrics_receiver_helper_test.go:203
                                        /home/jspaleta/sources/orgs/jspaleta/opentelemetry-collector-contrib/receiver/prometheusreceiver/metrics_receiver_labels_test.go:171
                                        /home/jspaleta/sources/orgs/jspaleta/opentelemetry-collector-contrib/receiver/prometheusreceiver/metrics_receiver_helper_test.go:647
            Error:          "1" is not less than or equal to "0"
            Test:           TestLabelValueLimitConfig/target1
            Messages:       want at least 1 valid scrapes, but got 0
=== RUN   TestLabelValueLimitConfig/target2
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 2 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 2 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 3 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 3 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 4 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 4 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 5 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 5 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 6 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 6 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
2022/08/05 12:29:31 httptest.Server blocked in Close after 5 seconds, waiting for connections:
  *net.TCPConn 0xc000284f48 127.0.0.1:47750 in state active
  *net.TCPConn 0xc000118dc0 127.0.0.1:47636 in state active
  *net.TCPConn 0xc0001183f8 127.0.0.1:47616 in state active
  *net.TCPConn 0xc0006125a8 127.0.0.1:47724 in state active
  *net.TCPConn 0xc000564758 127.0.0.1:47520 in state active
  *net.TCPConn 0xc0005648c0 127.0.0.1:47562 in state active
  *net.TCPConn 0xc000118fa8 127.0.0.1:47722 in state active
  *net.TCPConn 0xc0001193f0 127.0.0.1:47810 in state active
  *net.TCPConn 0xc000612868 127.0.0.1:47830 in state active
  *net.TCPConn 0xc000010ba0 127.0.0.1:47536 in state active
  *net.TCPConn 0xc000284de8 127.0.0.1:47694 in state active
  *net.TCPConn 0xc000284f50 127.0.0.1:47774 in state active
  *net.TCPConn 0xc0006a3740 127.0.0.1:47822 in state active
  *net.TCPConn 0xc0006a2578 127.0.0.1:47518 in state active
  *net.TCPConn 0xc000564a28 127.0.0.1:47620 in state active
  *net.TCPConn 0xc000284de0 127.0.0.1:47678 in state active
  *net.TCPConn 0xc0006a2998 127.0.0.1:47600 in state active
  *net.TCPConn 0xc0006a2db8 127.0.0.1:47658 in state active
  *net.TCPConn 0xc000284dd8 127.0.0.1:47670 in state active
  *net.TCPConn 0xc0006a2f18 127.0.0.1:47706 in state active
  *net.TCPConn 0xc0005648b8 127.0.0.1:47526 in state active
  *net.TCPConn 0xc000284878 127.0.0.1:47548 in state active
  *net.TCPConn 0xc0006a2830 127.0.0.1:47578 in state active
  *net.TCPConn 0xc000564cf0 127.0.0.1:47824 in state active
  *net.TCPConn 0xc000612448 127.0.0.1:47710 in state active
  *net.TCPConn 0xc0006a31d0 127.0.0.1:47764 in state active
  *net.TCPConn 0xc000564b90 127.0.0.1:47794 in state active
  *net.TCPConn 0xc0006125b8 127.0.0.1:47820 in state active
  *net.TCPConn 0xc0006a35e0 127.0.0.1:47802 in state active
  *net.TCPConn 0xc0006a2c58 127.0.0.1:47648 in state active
  *net.TCPConn 0xc000118dc8 127.0.0.1:47650 in state active
  *net.TCPConn 0xc0006125b0 127.0.0.1:47740 in state active
  *net.TCPConn 0xc0006a2c50 127.0.0.1:47632 in state active
  *net.TCPConn 0xc000564b88 127.0.0.1:47782 in state active
  *net.TCPConn 0xc0006a2838 127.0.0.1:47582 in state active
  *net.TCPConn 0xc0005648c8 127.0.0.1:47586 in state active
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 7 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 7 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 8 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 8 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 9 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 9 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 10 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 10 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 11 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 11 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 12 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 12 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 13 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 13 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 14 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 14 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 15 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 15 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 16 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 16 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 17 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 17 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 18 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 18 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 19 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 19 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 20 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 20 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 21 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 21 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 22 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 22 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 23 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 23 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target2/metrics pages: 1 index: 24 page code: N/A
ServerHTTP:: response Status: 404
ServerHTTP:: received a request, lets sleep for 500 msecs to ensure there is a test failure
ServerHTTP:: req endpoint /target1/metrics pages: 1 index: 24 page code: N/A
ServerHTTP:: response Status: 404
--- FAIL: TestLabelValueLimitConfig (30.07s)
    --- FAIL: TestLabelValueLimitConfig/target1 (0.00s)
    --- PASS: TestLabelValueLimitConfig/target2 (0.00s)
FAIL
exit status 1
FAIL    github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver   30.096s
jspaleta commented 2 years ago

Okay good news.

This looks like config that can be tuned. there is a config called scrape_interval that is too tight.. this needs to be increased for the test situation.

I just don't know how to set that...yet. Here's the promethus config being passed to the ScrapeManager object.

global:
  scrape_interval: 1m
  scrape_timeout: 10s
  evaluation_interval: 1m
scrape_configs:
- job_name: target1
  honor_timestamps: true
  scrape_interval: 100ms
  scrape_timeout: 100ms    
  metrics_path: /target1/metrics
  scheme: http
  label_value_length_limit: 25
  follow_redirects: true
  enable_http2: true
  static_configs:
  - targets:
    - 127.0.0.1:38107
- job_name: target2
  honor_timestamps: true
  scrape_interval: 100ms
  scrape_timeout: 100ms
  metrics_path: /target2/metrics
  scheme: http
  label_value_length_limit: 25
  follow_redirects: true
  enable_http2: true
  static_configs:
  - targets:
    - 127.0.0.1:38107
jspaleta commented 2 years ago

Okay I think I have this sorted out now. I'm going to prep a PR with the changes to scrape_interval and scrape_timeout so they allow for some slowness in the production server tests

jspaleta commented 2 years ago

@djaglowski PR #12986 up with proposed fix

djaglowski commented 2 years ago

Closing based on #12986