prometheus / compliance

A set of tests to check compliance with various Prometheus interfaces
Apache License 2.0
128 stars 45 forks source link

Conformance test queries with negative offset fail due to latency #94

Open smutukula opened 2 years ago

smutukula commented 2 years ago

Hi

PromQL conformance tests with negative offset fail because of latency issues.

Latencies might be different in storage for native prom and vendor's storage (Due to persistence, Georedundant, etc). Negative offset queries try to fetch data for the future timestamp and the availability of data makes a difference in results of the queries. By this, here the compliance tool is not testing the correctness of the queries but testing the latency of the storages.

image

Suppose a query with negative offset -5m made at 10:32:00:000. According to compliance tool, timestamps for queries are:

end_time: time.Now() - 2 min => 10:32:00:000 - 2 mins => 10:30:00:00 start_time: end_time - 10m => 10:30:00:00- 10 mins => 10:20:00:00

Query needs values till 10:35:00 (Because of -5m from end time)

At 10:32:00:000, Due to latency, value may not be available for timestamp 10:31:55:000 in vendor storage whereas it may be available for native prom (This can happen vice-versa as well). Here the value returned by native prom will be 101 whereas vendor implementation returns 100 (Previous available value).

I think we can overcome this by providing "query_time_parameters": end_time value as a time stamp that is less than 10 mins of current timestamp. But just like to know other thoughts on this issue and how to overcome it?

nikola-milikic commented 2 years ago

We have faced the same issue when running the test for Sysdig. This can be observed for queries with negative offsets of -5 and -10. As you have said, we are querying into the future for 3m for the first query (with offset -5), and 8m for the second query (with offset -10).

I also believe that this test should be focused on evaluating query compliance and should be decoupled from the ingestion speed in this case, and generally from the ingestion process as a whole.

I see three approaches to mitigating this:

  1. Increase the default query end time from now() - 2m to now() - 11m (one minute more than the largest negative offset). The consequence of this approach is that when running the test, we should wait at least 21m for data to get ingested to both reference Prometheus and the vendor implementation in comparison (to compensate for the queries with -10m and 10m offsets).
  2. On each test run, set .query_time_parameters.end_time parameter to a timestamp that is calculated as now() - 11m. But to be honest, I don't like specifying the exact timestamp of the query interval end (this can be scripted of course) since the test isn't reproducible with the same config file. Maybe I am spawning new reference Prometheus and the vendor platform for each test, so I cannot reuse the same config file e.g. from the last successful run that happened yesterday.
  3. Introduce a new config parameter .query_time_parameters.end_time_offsetthat if set, will overwrite the default 2m offset for the end of the query interval. This parameter can be used only if .query_time_parameters.end_time parameter is not set. If we set this parameter to 11m, then we will not face the problems with offset -10m query (since we get the same results as in case no. 1). But since offset -10m query is here to stay, then in my case this parameter would always be set to 11m.

I believe the first one is the least intrusive approach (here is the PR for it), but I would like to get opinions on other proposals as well.

juliusv commented 2 years ago

Thanks! Yeah, I think your PR is the right approach as a short-term workaround.

Long-term it would make sense to rethink how we create, ingest, and query the test data set in general. We'll still want to have some amount of randomness in the data set to avoid overfitting / missing weird corner cases (or accommodate for folks like Sysdig who have TSDB sample alignment limitations), but it would be good to get away from the requirement to actually let ingestion run in real-time for a couple of hours before being able to run tests. Prometheus has its own remote write receiver handler since a while ago by now, so that could be used to push all data in big batches.