grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
3.87k stars 472 forks source link

Flaky TestLabelNames_Cancelled #8428

Closed pracucci closed 2 weeks ago

pracucci commented 2 weeks ago

In this CI run I've seen TestLabelNames_Cancelled being flaky:

--- FAIL: TestLabelNames_Cancelled (0.07s)
    bucket_test.go:2909: Creating 2 1-sample series with 1ms interval in /tmp/TestLabelNames_Cancelled3562416338/001/0
    bucket_test.go:2909: Creating 2 1-sample series with 1ms interval in /tmp/TestLabelNames_Cancelled3562416338/001/1
    testing.go:1231: TempDir RemoveAll cleanup: unlinkat /tmp/TestLabelNames_Cancelled3562416338/001: directory not empty
level=info msg="created in-memory index cache" maxItemSizeBytes=13421[77](https://github.com/grafana/mimir/actions/runs/9585997214/job/26433070203?pr=8424#step:8:78)28 maxSizeBytes=1073741824 maxItems=maxInt
level=info msg="ring doesn't exist in KV store yet"
level=info msg="instance not found in the ring" instance=test ring=store-gateway
level=info msg="not loading tokens from file, tokens file path is empty"
level=info msg="waiting until store-gateway is JOINING in the ring"
level=info msg="store-gateway is JOINING in the ring"
level=info msg="synchronizing TSDB blocks for all users"
level=warn msg="failed to synchronize TSDB blocks" err="assert.AnError general error for testing"
level=info msg="ring lifecycler is shutting down" ring=store-gateway
level=info msg="unregistering instance from ring" ring=store-gateway
level=info msg="instance removed from the ring" ring=store-gateway
FAIL
FAIL    github.com/grafana/mimir/pkg/storegateway   374.966s
narqo commented 2 weeks ago

testing.go:1231: TempDir RemoveAll cleanup: unlinkat /tmp/TestLabelNames_Cancelled3562416338/001: directory not empty```

I will have a look later but from the first look, it might be that there is a race between a clean-up in testing.T.TmpDir and the internals of BucketStore.RemoveBlocksAndClose. The latter doesn't wait for the goroutine inside snapshotter (and indexReaderPool) to actually stop, so the goroutine can write a lazy-loaded index, while the bucket's directory is being cleaned out by the test.