thanos-io / promql-engine

Multi-threaded PromQL engine implementation based on the Volcano paper.
Apache License 2.0
135 stars 54 forks source link

stats: Implement TotalSamplesPerStep #466

Closed harry671003 closed 2 days ago

harry671003 commented 2 weeks ago

This is an attempt at implementing the samples per step stats

 Changes

Note

harry671003 commented 6 days ago

Requesting reviews. @fpetkovski @MichaHoffmann could you please take a look?

harry671003 commented 4 days ago

Just one question from my side, what is the difference between IncrementSamplesAtStep(samples int, step int) and IncrementSamplesAtTimestamp(samples int, t int64)? Do we still need both methods?

@fpetkovski - Thanks for asking this question. It prompted me to look deeper and uncover a bug.

IncrementSamplesAtTimestamp should be used when we don't know the correct step index. This is always the case in the new engine because of stepsBatch. See: https://github.com/prometheus/prometheus/blob/71c90c71d41948eea3f33dfd3ee0a5cc581a36f3/util/stats/query_stats.go#L290


We should always use IncrementSamplesAtTimestamp() in the new engine. In the new engine, we only know the timestamp and not the step index. This is because of the batching using stepsBatch.

This is what would happen if IncrementSamplesAtStep() is used in the VectorSelector. In the below example range query, there are a total of 61 steps which are executed in 7 batches of 10 steps.

{
  name: "vector selector",
  load: `load 30s
   http_requests_total{pod="nginx-1"} 1+1x1000
   http_requests_total{pod="nginx-2"} 1+2x1000`,
  query: `http_requests_total{pod="nginx-1"}`,
  start: time.Unix(0, 0),
  end:   time.Unix(1800, 0),
  step:  time.Second * 30,
},
// Old engine
exp: []int64{1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}

// New engine            
got: []int64{7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}

Since at the start of each batch, the currStep is reset to 0, the samples will always be counted towards the first 10 steps (assuming stepsBatch=10).

When IncrementSamplesAtTimestamp() is used instead, it will correctly increment the sample for the timestamp of the step irrespective of the batch.


I will remove IncrementSamplesAtStep() from the code base and always use IncrementSamplesAtTimestamp(). I will also extend the tests to cover multiple batches.

harry671003 commented 3 days ago

The fuzz failure doesn't seem related to this change.