open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.06k stars 2.36k forks source link

flaky test: fileconsumer TestSymlinkedFiles on darwin+arm64 #34494

Open codeboten opened 2 months ago

codeboten commented 2 months ago

Component(s)

pkg/stanza/fileconsumer

Describe the issue you're reporting

See https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/10290446477/job/28480438927 for more details:

=== Failed
=== FAIL: fileconsumer TestSymlinkedFiles (3.60s)
    sink.go:118: 
            Error Trace:    /Users/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/internal/emittest/sink.go:118
                                        /Users/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/file_test.go:369
make[2]: *** [test] Error 1
            Error:          timeout: expected: 3000, actual: 2600
            Test:           TestSymlinkedFiles

=== FAIL: fileconsumer TestSymlinkedFiles (re-run 1) (3.59s)
    sink.go:118: 
            Error Trace:    /Users/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/internal/emittest/sink.go:118
make[1]: *** [pkg/stanza] Error 2
                                        /Users/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/file_test.go:369
            Error:          timeout: expected: 3000, actual: 2900
            Test:           TestSymlinkedFiles

DONE 2 runs, 3285 tests, 2 failures in 74.502s
make: *** [gotest] Error 2
github-actions[bot] commented 2 months ago

Pinging code owners:

ChrsMark commented 2 months ago

I think that the issue might lie on the sink's buffer size at https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/67e92938376e5dc6c200899f7545c5cb7aec95e0/pkg/stanza/fileconsumer/file_test.go#L351. I was not able to reproduce it as is locally (on linux) but by decreasing the buffer size a little bit I saw that failing. However, I'm not sure how accurately this can be verified 🤔 Maybe just increasing the size of the buffer and see if flakiness disappear might do the work here:

diff --git a/pkg/stanza/fileconsumer/file_test.go b/pkg/stanza/fileconsumer/file_test.go
index 97e6fe8415..e2e8dd7406 100644
--- a/pkg/stanza/fileconsumer/file_test.go
+++ b/pkg/stanza/fileconsumer/file_test.go
@@ -348,7 +348,8 @@ func TestSymlinkedFiles(t *testing.T) {
        cfg := NewConfig().includeDir(targetTempDir)
        cfg.StartAt = "beginning"
        cfg.PollInterval = pollInterval
-       sink := emittest.NewSink(emittest.WithCallBuffer(numFiles * logLinesPerFile))
+       sink := emittest.NewSink(emittest.WithCallBuffer(numFiles * logLinesPerFile * 1.5))
        operator := testManagerWithSink(t, cfg, sink)

        require.NoError(t, operator.Start(testutil.NewUnscopedMockPersister()))

@djaglowski any other detail that might has slipped here?

djaglowski commented 2 months ago

I'm able to reproduce it locally on mac M1.

The sink buffer size should be fine as it is large enough to contain all 3000 lines. I tested increasing it in case and it still fails.

One thing I see is that the comment // The sleep time here must be larger than the poll_interval value does not appear to be followed correctly. Still trying to understand how the test was meant to work but increasing the sleep time does seem to fix it.

ChrsMark commented 2 months ago

One thing I see is that the comment // The sleep time here must be larger than the poll_interval value does not appear to be followed correctly. Still trying to understand how the test was meant to work but increasing the sleep time does seem to fix it.

The idea of the test (originally) is that the symlinks switch fast while the consumer still achieves consuming all of their contents.

The sleep time must be enough to give time to the consumer to get to the next poll, match the file, and consume it before the symlink has switched. The sleep time is pollInterval + 1*time.Millisecond so it's actually bigger than the poll interval by 1ms, no?

djaglowski commented 2 months ago

Yeah, you're right. I missed that it's a computed value.

Anyways, we seem to be dropping logs regularly when the symlinks are rotated. I am still looking into this.

djaglowski commented 2 months ago

I have not made progress on this issue yet but it's causing problems on main so I've opened #34569 to temporarily skip the test in the meantime.

github-actions[bot] commented 3 weeks ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.