grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
3.87k stars 476 forks source link

Investigation into query scheduler dimensions #8571

Open Logiraptor opened 2 weeks ago

Logiraptor commented 2 weeks ago

This issue is meant to aid in the investigation of https://github.com/grafana/mimir-squad/issues/1945. I'll reference code from this branch throughout: https://github.com/grafana/mimir/tree/logiraptor/prioritized-queues

Reproducing the problem

  1. Add flags to artificially slow down ingesters + store-gateways
  2. Add a k6 test and mimir helm config which can reproduce the issue

The basic idea is this:

  1. We run mimir with a single querier and concurrency = 4
  2. This means for each second of wall time, we have 4s of capacity for queries
  3. We set ingesters to 1s slowdown
  4. We set store-gateways to 5s slowdown
  5. Then we run 1 ingester query and 1 store-gateway query per second via k6
  6. This is 6s of load per 1s of wall time. we do not have enough capacity to service this load

Then I tested three versions of mimir with this setup:

Control: Here are the results with main

Full K6 Results

``` /\ |‾‾| /‾‾/ /‾‾/ /\ / \ | |/ / / / / \/ \ | ( / ‾‾\ / \ | |\ \ | (‾) | / __________ \ |__| \__\ \_____/ .io execution: local script: load-testing-with-k6.js output: - scenarios: (100.00%) 3 scenarios, 201 max VUs, 10m30s max duration (incl. graceful stop): * ingester_queries: 1.00 iterations/s for 10m0s (maxVUs: 10-100, exec: run_ingester_query, gracefulStop: 30s) * store_gateway_queries: 1.00 iterations/s for 10m0s (maxVUs: 10-100, exec: run_storegateway_query, gracefulStop: 30s) * writing_metrics: Up to 0.05 iterations/s for 10m0s over 3 stages (maxVUs: 1, exec: write, gracefulStop: 30s) ✓ write worked █ ingester ✗ expected request status to equal 200 ↳ 85% — ✓ 349 / ✗ 57 ✓ has valid json body ✓ expected status field to equal 'success' ✓ expected data.resultType field to equal 'vector' █ store-gateway ✗ expected request status to equal 200 ↳ 49% — ✓ 198 / ✗ 206 ✓ has valid json body ✓ expected status field to equal 'success' ✓ expected data.resultType field to equal 'vector' checks........................................: 89.39% ✓ 2217 ✗ 263 ✓ { type:ingester }...........................: 0.00% ✓ 0 ✗ 0 ✓ { type:store-gateway }......................: 0.00% ✓ 0 ✗ 0 ✓ { type:write }..............................: 100.00% ✓ 29 ✗ 0 data_received.................................: 180 kB 285 B/s data_sent.....................................: 833 kB 1.3 kB/s dropped_iterations............................: 241 0.38253/s group_duration................................: avg=1m33s min=1.02s med=1m56s max=2m0s p(90)=2m0s p(95)=2m0s http_req_blocked..............................: avg=286.87µs min=1µs med=5µs max=41.66ms p(90)=553.2µs p(95)=677.2µs http_req_connecting...........................: avg=249µs min=0s med=0s max=41.61ms p(90)=496.2µs p(95)=598.19µs http_req_duration.............................: avg=1m29s min=7.18ms med=1m56s max=2m0s p(90)=2m0s p(95)=2m0s { expected_response:true }..................: avg=1m16s min=7.18ms med=1m18s max=1m59s p(90)=1m58s p(95)=1m58s ✗ { type:ingester }...........................: avg=1m31s min=1.02s med=1m56s max=2m0s p(90)=2m0s p(95)=2m0s ✗ { type:store-gateway }......................: avg=1m34s min=5.02s med=2m0s max=2m0s p(90)=2m0s p(95)=2m0s ✓ { url:http://localhost:8080/api/v1/push }...: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s http_req_failed...............................: 31.34% ✓ 263 ✗ 576 http_req_receiving............................: avg=46.57µs min=0s med=51µs max=830µs p(90)=79µs p(95)=100µs http_req_sending..............................: avg=37.95µs min=5µs med=24µs max=760µs p(90)=66µs p(95)=79.09µs http_req_tls_handshaking......................: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s http_req_waiting..............................: avg=1m29s min=7.07ms med=1m56s max=2m0s p(90)=2m0s p(95)=2m0s http_reqs.....................................: 839 1.331714/s iteration_duration............................: avg=1m29s min=8.11ms med=1m56s max=2m0s p(90)=2m0s p(95)=2m0s iterations....................................: 839 1.331714/s vus...........................................: 150 min=2 max=200 vus_max.......................................: 201 min=21 max=201 running (10m30.0s), 000/201 VUs, 839 complete and 150 interrupted iterations ingester_queries ✓ [======================================] 076/100 VUs 10m0s 1.00 iters/s store_gateway_queries ✓ [======================================] 074/100 VUs 10m0s 1.00 iters/s writing_metrics ✓ [======================================] 0/1 VUs 10m0s 0.05 iters/s ERRO[0631] thresholds on metrics 'http_req_duration{type:ingester}, http_req_duration{type:store-gateway}' have been crossed ```

     █ ingester
       ✗ expected request status to equal 200
        ↳  85% — ✓ 349 / ✗ 57

     █ store-gateway

       ✗ expected request status to equal 200
        ↳  49% — ✓ 198 / ✗ 206

     ✗ { type:ingester }...........................: avg=1m58s    min=1s     med=1m59s  max=2m0s   p(90)=1m59s   p(95)=1m59s  
     ✗ { type:store-gateway }......................: avg=47.53s   min=5.01s  med=46.66s max=1m28s  p(90)=1m22s   p(95)=1m26s  

Here we see both ingesters and store-gateways have high p95 latency and many requests failed.

Test 2: Additional Queue Dimensions

For this one I just enabled the flag for additional queue dimensions.

Full K6 Results

``` /\ |‾‾| /‾‾/ /‾‾/ /\ / \ | |/ / / / / \/ \ | ( / ‾‾\ / \ | |\ \ | (‾) | / __________ \ |__| \__\ \_____/ .io execution: local script: load-testing-with-k6.js output: - scenarios: (100.00%) 3 scenarios, 201 max VUs, 10m30s max duration (incl. graceful stop): * ingester_queries: 1.00 iterations/s for 10m0s (maxVUs: 10-100, exec: run_ingester_query, gracefulStop: 30s) * store_gateway_queries: 1.00 iterations/s for 10m0s (maxVUs: 10-100, exec: run_storegateway_query, gracefulStop: 30s) * writing_metrics: Up to 0.05 iterations/s for 10m0s over 3 stages (maxVUs: 1, exec: write, gracefulStop: 30s) ✓ write worked █ ingester ✗ expected request status to equal 200 ↳ 3% — ✓ 13 / ✗ 370 ✓ has valid json body ✓ expected status field to equal 'success' ✓ expected data.resultType field to equal 'vector' █ store-gateway ✓ expected request status to equal 200 ✓ has valid json body ✓ expected status field to equal 'success' ✓ expected data.resultType field to equal 'vector' checks........................................: 84.50% ✓ 2018 ✗ 370 ✓ { type:ingester }...........................: 0.00% ✓ 0 ✗ 0 ✓ { type:store-gateway }......................: 0.00% ✓ 0 ✗ 0 ✓ { type:write }..............................: 100.00% ✓ 30 ✗ 0 data_received.................................: 164 kB 260 B/s data_sent.....................................: 861 kB 1.4 kB/s dropped_iterations............................: 209 0.331791/s group_duration................................: avg=1m18s min=1s med=1m21s max=2m0s p(90)=2m0s p(95)=2m0s http_req_blocked..............................: avg=252.43µs min=2µs med=10µs max=5.47ms p(90)=537µs p(95)=602µs http_req_connecting...........................: avg=223.3µs min=0s med=0s max=5.42ms p(90)=479.4µs p(95)=542.2µs http_req_duration.............................: avg=1m16s min=7.24ms med=1m18s max=2m0s p(90)=1m59s p(95)=1m59s { expected_response:true }..................: avg=45.66s min=7.24ms med=44.66s max=1m48s p(90)=1m23s p(95)=1m26s ✗ { type:ingester }...........................: avg=1m58s min=1s med=1m59s max=2m0s p(90)=1m59s p(95)=1m59s ✗ { type:store-gateway }......................: avg=47.53s min=5.01s med=46.66s max=1m28s p(90)=1m22s p(95)=1m26s ✓ { url:http://localhost:8080/api/v1/push }...: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s http_req_failed...............................: 41.24% ✓ 370 ✗ 527 http_req_receiving............................: avg=35.93µs min=0s med=46µs max=1ms p(90)=67µs p(95)=76µs http_req_sending..............................: avg=38.81µs min=8µs med=30µs max=1.03ms p(90)=64µs p(95)=75.19µs http_req_tls_handshaking......................: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s http_req_waiting..............................: avg=1m16s min=7.15ms med=1m18s max=2m0s p(90)=1m59s p(95)=1m59s http_reqs.....................................: 897 1.424001/s iteration_duration............................: avg=1m16s min=8.1ms med=1m18s max=2m0s p(90)=2m0s p(95)=2m0s iterations....................................: 897 1.424001/s vus...........................................: 125 min=2 max=174 vus_max.......................................: 177 min=21 max=177 running (10m29.9s), 000/177 VUs, 897 complete and 125 interrupted iterations ingester_queries ✓ [======================================] 075/100 VUs 10m0s 1.00 iters/s store_gateway_queries ✓ [======================================] 050/076 VUs 10m0s 1.00 iters/s writing_metrics ✓ [======================================] 0/1 VUs 10m0s 0.05 iters/s ERRO[0631] thresholds on metrics 'http_req_duration{type:ingester}, http_req_duration{type:store-gateway}' have been crossed ```

     █ ingester

       ✗ expected request status to equal 200
        ↳  3% — ✓ 13 / ✗ 370

     █ store-gateway

       ✓ expected request status to equal 200

     ✗ { type:ingester }...........................: avg=1m58s    min=1s     med=1m59s  max=2m0s   p(90)=1m59s   p(95)=1m59s  
     ✗ { type:store-gateway }......................: avg=47.53s   min=5.01s  med=46.66s max=1m28s  p(90)=1m22s   p(95)=1m26s  

In this test, requests were still slow, but surprisingly store-gateway requests did not fail, and ingester requests failed much more often.

Test 3: Round robin b/w components

For this test, I added a simple round-robin b/w query components, trying really hard not to change anything in the existing code. diff. The basic idea is to extract an interface from queueBroker and then write a new implementation which wraps the existing one.

Full K6 Results

``` /\ |‾‾| /‾‾/ /‾‾/ /\ / \ | |/ / / / / \/ \ | ( / ‾‾\ / \ | |\ \ | (‾) | / __________ \ |__| \__\ \_____/ .io execution: local script: load-testing-with-k6.js output: - scenarios: (100.00%) 3 scenarios, 201 max VUs, 10m30s max duration (incl. graceful stop): * ingester_queries: 1.00 iterations/s for 10m0s (maxVUs: 10-100, exec: run_ingester_query, gracefulStop: 30s) * store_gateway_queries: 1.00 iterations/s for 10m0s (maxVUs: 10-100, exec: run_storegateway_query, gracefulStop: 30s) * writing_metrics: Up to 0.05 iterations/s for 10m0s over 3 stages (maxVUs: 1, exec: write, gracefulStop: 30s) ✓ write worked █ ingester ✗ expected request status to equal 200 ↳ 2% — ✓ 8 / ✗ 375 ✓ has valid json body ✓ expected status field to equal 'success' ✓ expected data.resultType field to equal 'vector' █ store-gateway ✗ expected request status to equal 200 ↳ 99% — ✓ 496 / ✗ 1 ✓ has valid json body ✓ expected status field to equal 'success' ✓ expected data.resultType field to equal 'vector' checks........................................: 84.46% ✓ 2045 ✗ 376 ✓ { type:ingester }...........................: 0.00% ✓ 0 ✗ 0 ✓ { type:store-gateway }......................: 0.00% ✓ 0 ✗ 0 ✓ { type:write }..............................: 100.00% ✓ 29 ✗ 0 data_received.................................: 166 kB 263 B/s data_sent.....................................: 843 kB 1.3 kB/s dropped_iterations............................: 204 0.323806/s group_duration................................: avg=1m16s min=1.02s med=1m10s max=2m0s p(90)=2m0s p(95)=2m0s http_req_blocked..............................: avg=287.11µs min=1µs med=8µs max=18.01ms p(90)=542.4µs p(95)=711.59µs http_req_connecting...........................: avg=256.06µs min=0s med=0s max=17.47ms p(90)=472.6µs p(95)=615.19µs http_req_duration.............................: avg=1m14s min=7.89ms med=1m8s max=2m0s p(90)=2m0s p(95)=2m0s { expected_response:true }..................: avg=41.83s min=7.89ms med=42.3s max=1m20s p(90)=1m11s p(95)=1m15s ✗ { type:ingester }...........................: avg=1m57s min=1.02s med=2m0s max=2m0s p(90)=2m0s p(95)=2m0s ✗ { type:store-gateway }......................: avg=45.06s min=10.03s med=44.5s max=2m0s p(90)=1m12s p(95)=1m16s ✓ { url:http://localhost:8080/api/v1/push }...: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s http_req_failed...............................: 41.36% ✓ 376 ✗ 533 http_req_receiving............................: avg=38.31µs min=0s med=43µs max=511µs p(90)=81.2µs p(95)=101µs http_req_sending..............................: avg=39.23µs min=7µs med=28µs max=811µs p(90)=62µs p(95)=85µs http_req_tls_handshaking......................: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s http_req_waiting..............................: avg=1m14s min=7.77ms med=1m8s max=2m0s p(90)=2m0s p(95)=2m0s http_reqs.....................................: 909 1.442843/s iteration_duration............................: avg=1m14s min=8.83ms med=1m8s max=2m0s p(90)=2m0s p(95)=2m0s iterations....................................: 909 1.442843/s vus...........................................: 118 min=2 max=168 vus_max.......................................: 171 min=21 max=171 running (10m30.0s), 000/171 VUs, 909 complete and 118 interrupted iterations ingester_queries ✓ [======================================] 074/100 VUs 10m0s 1.00 iters/s store_gateway_queries ✓ [======================================] 044/070 VUs 10m0s 1.00 iters/s writing_metrics ✓ [======================================] 0/1 VUs 10m0s 0.05 iters/s ERRO[0631] thresholds on metrics 'http_req_duration{type:ingester}, http_req_duration{type:store-gateway}' have been crossed ```

Screenshot 2024-06-28 at 5 28 16 PM

     █ ingester

       ✗ expected request status to equal 200
        ↳  2% — ✓ 8 / ✗ 375

     █ store-gateway

       ✗ expected request status to equal 200
        ↳  99% — ✓ 496 / ✗ 1

     ✗ { type:ingester }...........................: avg=1m57s    min=1.02s  med=2m0s  max=2m0s    p(90)=2m0s    p(95)=2m0s    
     ✗ { type:store-gateway }......................: avg=45.06s   min=10.03s med=44.5s max=2m0s    p(90)=1m12s   p(95)=1m16s   

This test didn't do any better than the previous tests, and from this I conclude that simple round robin b/w components is insufficient.

Test 4: Prioritized component per-worker

For this one, I implemented an idea based on my reading of the paper: 2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services.

The basic idea is to assign each querier worker a certain component to prioritize. When dequeuing a request, that worker will always first check the prioritized component before attempting to dequeue from other components. As long as each component has a worker assigned, then we can guarantee that all requests will eventually be processed, and it also guarantees that requests to one component can't be blocked behind an arbitrary length of other requests.

Heres the diff from the round-robin example: diff

Full K6 Results

``` /\ |‾‾| /‾‾/ /‾‾/ /\ / \ | |/ / / / / \/ \ | ( / ‾‾\ / \ | |\ \ | (‾) | / __________ \ |__| \__\ \_____/ .io execution: local script: load-testing-with-k6.js output: - scenarios: (100.00%) 3 scenarios, 201 max VUs, 10m30s max duration (incl. graceful stop): * ingester_queries: 1.00 iterations/s for 10m0s (maxVUs: 10-100, exec: run_ingester_query, gracefulStop: 30s) * store_gateway_queries: 1.00 iterations/s for 10m0s (maxVUs: 10-100, exec: run_storegateway_query, gracefulStop: 30s) * writing_metrics: Up to 0.05 iterations/s for 10m0s over 3 stages (maxVUs: 1, exec: write, gracefulStop: 30s) ✓ write worked █ ingester ✓ expected request status to equal 200 ✓ has valid json body ✓ expected status field to equal 'success' ✓ expected data.resultType field to equal 'vector' █ store-gateway ✗ expected request status to equal 200 ↳ 64% — ✓ 266 / ✗ 149 ✓ has valid json body ✓ expected status field to equal 'success' ✓ expected data.resultType field to equal 'vector' checks........................................: 95.78% ✓ 3382 ✗ 149 ✓ { type:ingester }...........................: 0.00% ✓ 0 ✗ 0 ✓ { type:store-gateway }......................: 0.00% ✓ 0 ✗ 0 ✓ { type:write }..............................: 100.00% ✓ 30 ✗ 0 data_received.................................: 273 kB 434 B/s data_sent.....................................: 882 kB 1.4 kB/s dropped_iterations............................: 134 0.212697/s group_duration................................: avg=47.03s min=1s med=26.25s max=2m0s p(90)=2m0s p(95)=2m0s http_req_blocked..............................: avg=132.66µs min=1µs med=4µs max=19.96ms p(90)=433.8µs p(95)=507.59µs http_req_connecting...........................: avg=112.86µs min=0s med=0s max=19.89ms p(90)=374.4µs p(95)=442.39µs http_req_duration.............................: avg=45.64s min=6.25ms med=26.08s max=2m0s p(90)=2m0s p(95)=2m0s { expected_response:true }..................: avg=32.87s min=6.25ms med=24.95s max=1m59s p(90)=1m21s p(95)=1m40s ✗ { type:ingester }...........................: avg=20.68s min=1s med=20.22s max=35.2s p(90)=33.98s p(95)=34.46s ✗ { type:store-gateway }......................: avg=1m23s min=5s med=1m34s max=2m0s p(90)=2m0s p(95)=2m0s ✓ { url:http://localhost:8080/api/v1/push }...: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s http_req_failed...............................: 14.65% ✓ 149 ✗ 868 http_req_receiving............................: avg=84.28µs min=0s med=65µs max=3.45ms p(90)=124.4µs p(95)=162.79µs http_req_sending..............................: avg=32.14µs min=4µs med=22µs max=507µs p(90)=54.4µs p(95)=77.19µs http_req_tls_handshaking......................: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s http_req_waiting..............................: avg=45.64s min=6.17ms med=26.08s max=2m0s p(90)=2m0s p(95)=2m0s http_reqs.....................................: 1017 1.614275/s iteration_duration............................: avg=45.64s min=6.97ms med=26.09s max=2m0s p(90)=2m0s p(95)=2m0s iterations....................................: 1017 1.614275/s vus...........................................: 81 min=2 max=132 vus_max.......................................: 135 min=21 max=135 running (10m30.0s), 000/135 VUs, 1017 complete and 81 interrupted iterations ingester_queries ✓ [======================================] 005/034 VUs 10m0s 1.00 iters/s store_gateway_queries ✓ [======================================] 076/100 VUs 10m0s 1.00 iters/s writing_metrics ✓ [======================================] 0/1 VUs 10m0s 0.05 iters/s ERRO[0631] thresholds on metrics 'http_req_duration{type:ingester}, http_req_duration{type:store-gateway}' have been crossed ```

Screenshot 2024-07-01 at 11 20 38 AM

     █ ingester

       ✓ expected request status to equal 200

     █ store-gateway

       ✗ expected request status to equal 200
        ↳  64% — ✓ 266 / ✗ 149

     ✗ { type:ingester }...........................: avg=20.68s   min=1s     med=20.22s max=35.2s   p(90)=33.98s  p(95)=34.46s  
     ✗ { type:store-gateway }......................: avg=1m23s    min=5s     med=1m34s  max=2m0s    p(90)=2m0s    p(95)=2m0s    

This time has a different story. p95 ingester latency is 34s (from 2m), and not a single ingester request failed. 36% of store-gateway requests failed, which we know is inevitable because there is not enough capacity in this test scenario.

francoposa commented 1 week ago

added this the tasklist of the main issue so it's more directly tracked with the other subtasks