grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.04k stars 512 forks source link

Race observed in `TestMultiDimensionalQueueAlgorithmSlowConsumerEffects` test. #9253

Open pstibrany opened 1 week ago

pstibrany commented 1 week ago

Race observed in TestMultiDimensionalQueueAlgorithmSlowConsumerEffects test.

==================
WARNING: DATA RACE
Write at 0x00c0003ba0b8 by goroutine 2879:
  github.com/grafana/mimir/pkg/scheduler/queue.(*tenantQuerierAssignments).setup()
      /__w/mimir/mimir/pkg/scheduler/queue/tenant_querier_assignment.go:423 +0xc4
  github.com/grafana/mimir/pkg/scheduler/queue.(*MultiQueuingAlgorithmTreeQueue).Dequeue()
      /__w/mimir/mimir/pkg/scheduler/queue/multi_queuing_algorithm_tree_queue.go:85 +0x115
  github.com/grafana/mimir/pkg/scheduler/queue.TestMultiDimensionalQueueAlgorithmSlowConsumerEffects.func1()
      /__w/mimir/mimir/pkg/scheduler/queue/multi_queuing_algorithm_tree_queue_benchmark_test.go:485 +0xc9b
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1743 +0x44

Previous read at 0x00c0003ba0b8 by goroutine 2880:
  github.com/grafana/mimir/pkg/scheduler/queue.(*queueBroker).dequeueRequestForQuerier()
      /__w/mimir/mimir/pkg/scheduler/queue/tenant_queues.go:152 +0x146
  github.com/grafana/mimir/pkg/scheduler/queue.(*RequestQueue).trySendNextRequestForQuerier()
      /__w/mimir/mimir/pkg/scheduler/queue/queue.go:422 +0x76
  github.com/grafana/mimir/pkg/scheduler/queue.(*RequestQueue).dispatcherLoop()
      /__w/mimir/mimir/pkg/scheduler/queue/queue.go:340 +0x364
  github.com/grafana/mimir/pkg/scheduler/queue.(*RequestQueue).starting.gowrap1()
      /__w/mimir/mimir/pkg/scheduler/queue/queue.go:289 +0x33

Goroutine 2879 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1743 +0x825
  github.com/grafana/mimir/pkg/scheduler/queue.TestMultiDimensionalQueueAlgorithmSlowConsumerEffects()
      /__w/mimir/mimir/pkg/scheduler/queue/multi_queuing_algorithm_tree_queue_benchmark_test.go:422 +0x1e04
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1743 +0x44

Goroutine 2880 (running) created at:
  github.com/grafana/mimir/pkg/scheduler/queue.(*RequestQueue).starting()
      /__w/mimir/mimir/pkg/scheduler/queue/queue.go:289 +0x8d
  github.com/grafana/mimir/pkg/scheduler/queue.TestMultiDimensionalQueueAlgorithmSlowConsumerEffects.func1()
      /__w/mimir/mimir/pkg/scheduler/queue/multi_queuing_algorithm_tree_queue_benchmark_test.go:442 +0x486
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1743 +0x44
==================
dimitarvdimitrov commented 2 days ago

another one

Details

``` ================== WARNING: DATA RACE Write at 0x00c000539ad8 by goroutine 4103: github.com/grafana/mimir/pkg/scheduler/queue.(*tenantQuerierAssignments).setup() /__w/mimir/mimir/pkg/scheduler/queue/tenant_querier_assignment.go:237 +0xc4 github.com/grafana/mimir/pkg/scheduler/queue.(*MultiQueuingAlgorithmTreeQueue).Dequeue() /__w/mimir/mimir/pkg/scheduler/queue/multi_queuing_algorithm_tree_queue.go:85 +0x115 github.com/grafana/mimir/pkg/scheduler/queue.TestMultiDimensionalQueueAlgorithmSlowConsumerEffects.func1() /__w/mimir/mimir/pkg/scheduler/queue/multi_queuing_algorithm_tree_queue_benchmark_test.go:485 +0xc9b testing.tRunner() /usr/local/go/src/testing/testing.go:1690 +0x226 testing.(*T).Run.gowrap1() /usr/local/go/src/testing/testing.go:1743 +0x44 Previous read at 0x00c000539ad8 by goroutine 4104: github.com/grafana/mimir/pkg/scheduler/queue.(*queueBroker).dequeueRequestForQuerier() /__w/mimir/mimir/pkg/scheduler/queue/tenant_queues.go:156 +0x153 github.com/grafana/mimir/pkg/scheduler/queue.(*RequestQueue).trySendNextRequestForQuerier() /__w/mimir/mimir/pkg/scheduler/queue/queue.go:392 +0x[76](https://github.com/grafana/mimir/actions/runs/10964505773/job/30448297525#step:8:77) github.com/grafana/mimir/pkg/scheduler/queue.(*RequestQueue).dispatcherLoop() /__w/mimir/mimir/pkg/scheduler/queue/queue.go:310 +0x364 github.com/grafana/mimir/pkg/scheduler/queue.(*RequestQueue).starting.gowrap1() /__w/mimir/mimir/pkg/scheduler/queue/queue.go:259 +0x33 Goroutine 4103 (running) created at: testing.(*T).Run() /usr/local/go/src/testing/testing.go:1743 +0x825 github.com/grafana/mimir/pkg/scheduler/queue.TestMultiDimensionalQueueAlgorithmSlowConsumerEffects() /__w/mimir/mimir/pkg/scheduler/queue/multi_queuing_algorithm_tree_queue_benchmark_test.go:422 +0x1d84 testing.tRunner() /usr/local/go/src/testing/testing.go:1690 +0x226 testing.(*T).Run.gowrap1() /usr/local/go/src/testing/testing.go:1743 +0x44 Goroutine 4104 (running) created at: github.com/grafana/mimir/pkg/scheduler/queue.(*RequestQueue).starting() /__w/mimir/mimir/pkg/scheduler/queue/queue.go:259 +0x8d github.com/grafana/mimir/pkg/scheduler/queue.TestMultiDimensionalQueueAlgorithmSlowConsumerEffects.func1() /__w/mimir/mimir/pkg/scheduler/queue/multi_queuing_algorithm_tree_queue_benchmark_test.go:442 +0x486 testing.tRunner() /usr/local/go/src/testing/testing.go:1690 +0x226 testing.(*T).Run.gowrap1() /usr/local/go/src/testing/testing.go:1743 +0x44 ================== --- FAIL: TestMultiDimensionalQueueAlgorithmSlowConsumerEffects (15.47s) --- FAIL: TestMultiDimensionalQueueAlgorithmSlowConsumerEffects/tree:_worker-queue_prioritization_->_tenant-querier_tree,_2_tenants,_first_with_25pct_slow_queries,_second_with_75pct_slow_queries (0.67s) multi_queuing_algorithm_tree_queue_benchmark_test.go:4[78](https://github.com/grafana/mimir/actions/runs/10964505773/job/30448297525#step:8:79): tree: worker-queue prioritization -> tenant-querier tree, 2 tenants, first with 25pct slow queries, second with 75pct slow queries: seconds in queue: [ingester: mean: 0.0290 stddev: 0.01 store-gateway: mean: 0.2595 stddev: 0.17] multi_queuing_algorithm_tree_queue_benchmark_test.go:4[79](https://github.com/grafana/mimir/actions/runs/10964505773/job/30448297525#step:8:80): tree: worker-queue prioritization -> tenant-querier tree, 2 tenants, first with 25pct slow queries, second with 75pct slow queries: seconds in queue:[tenant-0: mean: 0.0329 stddev: 0.01 tenant-1: mean: 0.0169 stddev: 0.00] testing.go:1399: race detected during execution of test multi_queuing_algorithm_tree_queue_benchmark_test.go:492: Results by query component: multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: tenant-querier -> query component round-robin tree, 1 tenant, 10pct slow queries: seconds in queue: [ingester: mean: 0.1128 stddev: 0.02 store-gateway: mean: 0.0282 stddev: 0.04] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: query component round-robin -> tenant-querier tree, 1 tenant, 10pct slow queries: seconds in queue: [ingester: mean: 0.1191 stddev: 0.02 store-gateway: mean: 0.0270 stddev: 0.04] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: worker-queue prioritization -> tenant-querier tree, 1 tenant, 10pct slow queries: seconds in queue: [ingester: mean: 0.0283 stddev: 0.01 store-gateway: mean: 0.0323 stddev: 0.03] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: tenant-querier -> query component round-robin tree, 1 tenant, 25pct slow queries: seconds in queue: [ingester: mean: 0.2079 stddev: 0.05 store-gateway: mean: 0.0963 stddev: 0.08] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: query component round-robin -> tenant-querier tree, 1 tenant, 25pct slow queries: seconds in queue: [ingester: mean: 0.2359 stddev: 0.07 store-gateway: mean: 0.1067 stddev: 0.08] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: worker-queue prioritization -> tenant-querier tree, 1 tenant, 25pct slow queries: seconds in queue: [ingester: mean: 0.0247 stddev: 0.01 store-gateway: mean: 0.1124 stddev: 0.08] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: tenant-querier -> query component round-robin tree, 1 tenant, 50pct slow queries: seconds in queue: [ingester: mean: 0.3968 stddev: 0.17 store-gateway: mean: 0.2492 stddev: 0.16] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: query component round-robin -> tenant-querier tree, 1 tenant, 50pct slow queries: seconds in queue: [ingester: mean: 0.3963 stddev: 0.18 store-gateway: mean: 0.2500 stddev: 0.16] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: worker-queue prioritization -> tenant-querier tree, 1 tenant, 50pct slow queries: seconds in queue: [ingester: mean: 0.0200 stddev: 0.00 store-gateway: mean: 0.2642 stddev: 0.17] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: tenant-querier -> query component round-robin tree, 1 tenant, 75pct slow queries: seconds in queue: [ingester: mean: 0.2599 stddev: 0.16 store-gateway: mean: 0.4161 stddev: 0.25] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: query component round-robin -> tenant-querier tree, 1 tenant, 75pct slow queries: seconds in queue: [ingester: mean: 0.2574 stddev: 0.17 store-gateway: mean: 0.4107 stddev: 0.26] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: worker-queue prioritization -> tenant-querier tree, 1 tenant, 75pct slow queries: seconds in queue: [ingester: mean: 0.0139 stddev: 0.00 store-gateway: mean: 0.4102 stddev: 0.25] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: tenant-querier -> query component round-robin tree, 1 tenant, 90pct slow queries: seconds in queue: [ingester: mean: 0.1076 stddev: 0.08 store-gateway: mean: 0.4847 stddev: 0.30] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: query component round-robin -> tenant-querier tree, 1 tenant, 90pct slow queries: seconds in queue: [ingester: mean: 0.0822 stddev: 0.07 store-gateway: mean: 0.5012 stddev: 0.31] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: worker-queue prioritization -> tenant-querier tree, 1 tenant, 90pct slow queries: seconds in queue: [ingester: mean: 0.0103 stddev: 0.00 store-gateway: mean: 0.4971 stddev: 0.31] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: tenant-querier -> query component round-robin tree, 2 tenants, first with 10pct slow queries, second with 90pct slow queries: seconds in queue: [ingester: mean: 0.1291 stddev: 0.04 store-gateway: mean: 0.2398 stddev: 0.16] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: query component round-robin -> tenant-querier tree, 2 tenants, first with 10pct slow queries, second with 90pct slow queries: seconds in queue: [ingester: mean: 0.1437 stddev: 0.04 store-gateway: mean: 0.2519 stddev: 0.16] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: worker-queue prioritization -> tenant-querier tree, 2 tenants, first with 10pct slow queries, second with 90pct slow queries: seconds in queue: [ingester: mean: 0.0200 stddev: 0.00 store-gateway: mean: 0.2717 stddev: 0.18] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: tenant-querier -> query component round-robin tree, 2 tenants, first with 25pct slow queries, second with 75pct slow queries: seconds in queue: [ingester: mean: 0.2099 stddev: 0.08 store-gateway: mean: 0.2400 stddev: 0.16] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: query component round-robin -> tenant-querier tree, 2 tenants, first with 25pct slow queries, second with 75pct slow queries: seconds in queue: [ingester: mean: 0.2298 stddev: 0.09 store-gateway: mean: 0.2440 stddev: 0.16] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: worker-queue prioritization -> tenant-querier tree, 2 tenants, first with 25pct slow queries, second with 75pct slow queries: seconds in queue: [ingester: mean: 0.0290 stddev: 0.01 store-gateway: mean: 0.2595 stddev: 0.17] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: tenant-querier -> query component round-robin tree, 2 tenants, first with 50pct slow queries, second with 50pct slow queries: seconds in queue: [ingester: mean: 0.3658 stddev: 0.17 store-gateway: mean: 0.2202 stddev: 0.14] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: query component round-robin -> tenant-querier tree, 2 tenants, first with 50pct slow queries, second with 50pct slow queries: seconds in queue: [ingester: mean: 0.3548 stddev: 0.15 store-gateway: mean: 0.2205 stddev: 0.14] multi_queuing_algorithm_tree_queue_benchmark_test.go:494: tree: worker-queue prioritization -> tenant-querier tree, 2 tenants, first with 50pct slow queries, second with 50pct slow queries: seconds in queue: [ingester: mean: 0.0257 stddev: 0.00 store-gateway: mean: 0.2188 stddev: 0.14] multi_queuing_algorithm_tree_queue_benchmark_test.go:497: Results for ingester-only queries by tenant ID: multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: tenant-querier -> query component round-robin tree, 1 tenant, 10pct slow queries: seconds in queue:[tenant-0: mean: 0.1128 stddev: 0.02] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: query component round-robin -> tenant-querier tree, 1 tenant, 10pct slow queries: seconds in queue:[tenant-0: mean: 0.1191 stddev: 0.02] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: worker-queue prioritization -> tenant-querier tree, 1 tenant, 10pct slow queries: seconds in queue:[tenant-0: mean: 0.0283 stddev: 0.01] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: tenant-querier -> query component round-robin tree, 1 tenant, 25pct slow queries: seconds in queue:[tenant-0: mean: 0.2079 stddev: 0.05] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: query component round-robin -> tenant-querier tree, 1 tenant, 25pct slow queries: seconds in queue:[tenant-0: mean: 0.2359 stddev: 0.07] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: worker-queue prioritization -> tenant-querier tree, 1 tenant, 25pct slow queries: seconds in queue:[tenant-0: mean: 0.0247 stddev: 0.01] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: tenant-querier -> query component round-robin tree, 1 tenant, 50pct slow queries: seconds in queue:[tenant-0: mean: 0.3968 stddev: 0.17] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: query component round-robin -> tenant-querier tree, 1 tenant, 50pct slow queries: seconds in queue:[tenant-0: mean: 0.3963 stddev: 0.18] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: worker-queue prioritization -> tenant-querier tree, 1 tenant, 50pct slow queries: seconds in queue:[tenant-0: mean: 0.0200 stddev: 0.00] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: tenant-querier -> query component round-robin tree, 1 tenant, 75pct slow queries: seconds in queue:[tenant-0: mean: 0.2599 stddev: 0.16] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: query component round-robin -> tenant-querier tree, 1 tenant, 75pct slow queries: seconds in queue:[tenant-0: mean: 0.2574 stddev: 0.17] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: worker-queue prioritization -> tenant-querier tree, 1 tenant, 75pct slow queries: seconds in queue:[tenant-0: mean: 0.0139 stddev: 0.00] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: tenant-querier -> query component round-robin tree, 1 tenant, 90pct slow queries: seconds in queue:[tenant-0: mean: 0.1076 stddev: 0.08] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: query component round-robin -> tenant-querier tree, 1 tenant, 90pct slow queries: seconds in queue:[tenant-0: mean: 0.0822 stddev: 0.07] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: worker-queue prioritization -> tenant-querier tree, 1 tenant, 90pct slow queries: seconds in queue:[tenant-0: mean: 0.0103 stddev: 0.00] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: tenant-querier -> query component round-robin tree, 2 tenants, first with 10pct slow queries, second with 90pct slow queries: seconds in queue:[tenant-0: mean: 0.13[80](https://github.com/grafana/mimir/actions/runs/10964505773/job/30448297525#step:8:81) stddev: 0.02 tenant-1: mean: 0.0652 stddev: 0.05] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: query component round-robin -> tenant-querier tree, 2 tenants, first with 10pct slow queries, second with 90pct slow queries: seconds in queue:[tenant-0: mean: 0.1456 stddev: 0.04 tenant-1: mean: 0.1286 stddev: 0.08] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: worker-queue prioritization -> tenant-querier tree, 2 tenants, first with 10pct slow queries, second with 90pct slow queries: seconds in queue:[tenant-0: mean: 0.0209 stddev: 0.00 tenant-1: mean: 0.0119 stddev: 0.00] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: tenant-querier -> query component round-robin tree, 2 tenants, first with 25pct slow queries, second with 75pct slow queries: seconds in queue:[tenant-0: mean: 0.2139 stddev: 0.07 tenant-1: mean: 0.1968 stddev: 0.10] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: query component round-robin -> tenant-querier tree, 2 tenants, first with 25pct slow queries, second with 75pct slow queries: seconds in queue:[tenant-0: mean: 0.2294 stddev: 0.07 tenant-1: mean: 0.2309 stddev: 0.13] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: worker-queue prioritization -> tenant-querier tree, 2 tenants, first with 25pct slow queries, second with 75pct slow queries: seconds in queue:[tenant-0: mean: 0.0329 stddev: 0.01 tenant-1: mean: 0.0169 stddev: 0.00] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: tenant-querier -> query component round-robin tree, 2 tenants, first with 50pct slow queries, second with 50pct slow queries: seconds in queue:[tenant-0: mean: 0.3527 stddev: 0.17 tenant-1: mean: 0.3802 stddev: 0.17] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: query component round-robin -> tenant-querier tree, 2 tenants, first with 50pct slow queries, second with 50pct slow queries: seconds in queue:[tenant-0: mean: 0.32[88](https://github.com/grafana/mimir/actions/runs/10964505773/job/30448297525#step:8:89) stddev: 0.16 tenant-1: mean: 0.3794 stddev: 0.14] multi_queuing_algorithm_tree_queue_benchmark_test.go:499: tree: worker-queue prioritization -> tenant-querier tree, 2 tenants, first with 50pct slow queries, second with 50pct slow queries: seconds in queue:[tenant-0: mean: 0.0261 stddev: 0.00 tenant-1: mean: 0.0252 stddev: 0.00] FAIL ```

@chencs @francoposa does anything come to mind?