Open smutukula opened 2 years ago
We have faced the same issue when running the test for Sysdig. This can be observed for queries with negative offsets of -5 and -10. As you have said, we are querying into the future for 3m for the first query (with offset -5), and 8m for the second query (with offset -10).
I also believe that this test should be focused on evaluating query compliance and should be decoupled from the ingestion speed in this case, and generally from the ingestion process as a whole.
I see three approaches to mitigating this:
now() - 2m
to now() - 11m
(one minute more than the largest negative offset). The consequence of this approach is that when running the test, we should wait at least 21m for data to get ingested to both reference Prometheus and the vendor implementation in comparison (to compensate for the queries with -10m and 10m offsets)..query_time_parameters.end_time
parameter to a timestamp that is calculated as now() - 11m
. But to be honest, I don't like specifying the exact timestamp of the query interval end (this can be scripted of course) since the test isn't reproducible with the same config file. Maybe I am spawning new reference Prometheus and the vendor platform for each test, so I cannot reuse the same config file e.g. from the last successful run that happened yesterday..query_time_parameters.end_time_offset
that if set, will overwrite the default 2m offset for the end of the query interval. This parameter can be used only if .query_time_parameters.end_time
parameter is not set. If we set this parameter to 11m, then we will not face the problems with offset -10m
query (since we get the same results as in case no. 1). But since offset -10m
query is here to stay, then in my case this parameter would always be set to 11m.I believe the first one is the least intrusive approach (here is the PR for it), but I would like to get opinions on other proposals as well.
Thanks! Yeah, I think your PR is the right approach as a short-term workaround.
Long-term it would make sense to rethink how we create, ingest, and query the test data set in general. We'll still want to have some amount of randomness in the data set to avoid overfitting / missing weird corner cases (or accommodate for folks like Sysdig who have TSDB sample alignment limitations), but it would be good to get away from the requirement to actually let ingestion run in real-time for a couple of hours before being able to run tests. Prometheus has its own remote write receiver handler since a while ago by now, so that could be used to push all data in big batches.
Hi
PromQL conformance tests with negative offset fail because of latency issues.
Latencies might be different in storage for native prom and vendor's storage (Due to persistence, Georedundant, etc). Negative offset queries try to fetch data for the future timestamp and the availability of data makes a difference in results of the queries. By this, here the compliance tool is not testing the correctness of the queries but testing the latency of the storages.
Suppose a query with negative offset -5m made at 10:32:00:000. According to compliance tool, timestamps for queries are:
end_time: time.Now() - 2 min => 10:32:00:000 - 2 mins => 10:30:00:00 start_time: end_time - 10m => 10:30:00:00- 10 mins => 10:20:00:00
Query needs values till 10:35:00 (Because of -5m from end time)
At 10:32:00:000, Due to latency, value may not be available for timestamp 10:31:55:000 in vendor storage whereas it may be available for native prom (This can happen vice-versa as well). Here the value returned by native prom will be 101 whereas vendor implementation returns 100 (Previous available value).
I think we can overcome this by providing "query_time_parameters": end_time value as a time stamp that is less than 10 mins of current timestamp. But just like to know other thoughts on this issue and how to overcome it?