jacksontj / promxy

An aggregating proxy to enable HA prometheus
MIT License
1.14k stars 128 forks source link

Expected type range vector in call to function "delta", got instant vector #647

Closed CH-DrewWatson closed 5 months ago

CH-DrewWatson commented 6 months ago

TLDR

Seeing inconsistent results across different VictoriaMetrics targets when querying through Promxy. The results are consistent when queries are run directly against each VictoriaMetrics instance.

Description

I'm working on a migration from an old, self-hosted VictoriaMetrics to a new, managed cloud instance of VictoriaMetrics.

In Grafana, I have a datasource for each of the old, the new, and Promxy (that aggregates both old and new into one datasource).

In testing Promxy, prior to any migration happening, I have found that a lot of our existing Grafana dashboards contain queries that do not work with Promxy, all with the same error (though the function may vary).

The query below runs fine against the old datasource and the new but fails when run against the Promxy datasource.

Here is a sample failing query:

delta(sum(some_count{env="$env",job="$job"}) by (some_code))-delta(sum(some_failed{env="$env",job="$job"}) by (some_code))

And this is the error:

bad_data: 1:7: parse error: expected type range vector in call to function "delta", got instant vector

Promxy log only shows some 400s with no additional details. Not posting it here intentionally.

Adding a range and resolution to the query as shown below allows the query to be ran against Promxy when the time range forces it to hit only the new, managed VictoriaMetrics (based on absolute_time_range in config). If time range includes the old instance, it fails with the error below.

Query with range and resolution:

delta(sum(some_count{env="$env",job="$job"}) by (some_code)[5m:10s])-delta(sum(some_failed{env="$env",job="$job"}) by (some_code)[5m:10s])

Error with range and resolution when hitting older VictoriaMetrics:

execution: unexpected error: runtime error: invalid memory address or nil pointer dereference

As you can see, adding range and resolution is not a viable solution since it requires a lot of manual updates to migrate and does not work with our older metrics.

Promxy as a datasource is working on many other queries, even with date ranges that span both old and new targets. Was hoping for Promxy to be the solution to a simplified and seamless migration so I'm definitely interested in anything that will make that happen.

Prior Issues

I have searched prior issues and the two that exist have been closed as complete 3 years ago, though this feels like the same issue.

Promxy Version

Initially discovered issue on v0.0.77 but have since upgraded to v0.0.85 (current) and issue persists.

jacksontj commented 5 months ago

First off; thanks for reporting!

I took a look into this real quick and I'm surprised that the first query works on the old datasource. When I put that into prometheus-delta(sum(prometheus_http_requests_total%7Bjob%3D%22prometheus%22%7D)%20by%20(code))&g0.tab=1&g0.display_mode=lines&g0.show_exemplars=0&g0.range_input=1h) I get an error -- because the delta is expecting a range vector not a single point in time. So with that this first error seems to be what the prometheus behavior is (maybe VM has some default range feature?) -- but this seems like its "working as intended".

As it is now; with promxy pointing at demo.robustperception.io:9090 with the following query I am unable to reproduce the error:

delta(sum(prometheus_http_requests_total{job="prometheus"}) by (code)[5m:10s])-delta(sum(prometheus_http_requests_total{job="prometheus"}) by (code)[5m:10s])

BUT! the error you are pasting is a nil pointer error not a "query error" (meaning the query was valid; it just exploded somewhere trying to fulfill it). If you take a look either in the server logs or in the response there should actually be a big backtrace -- if you could provide that it would speed up the debugging (as of right now I'm not seeing anything obvious; but if it was I guess some test would have caught it already :D ). Definitely looking forward to the trace here; thankfully with that trace nil pointer fixes are generally easy fixes.

CH-DrewWatson commented 5 months ago

Thanks for the the reply!

In parallel, I had opened a support case with VictoriaMetrics and figured I'd post their reply here...

I've made a research for range vector type error.

I found out that lookbehind window is required param for query_range requests at prometheus standard.

E.g. delta(metric[1d]) must have lookbehind window 1d. VictoriaMetrics allows skip this field and detect it automatically, based on distance between sample timestamps.

Since, promxy uses standard prometheus library for parsing expressions, it enforces those parameter for any range queries.

I'll try to get a trace log for you soon.

CH-DrewWatson commented 5 months ago

Well, I can't seem to reproduce the nil pointer error! That one may have been fixed by the upgrade to v0.0.85, I wasn't focused on this error as much as the original.

The range vector error still exists and, to be clear, the query without range and resolution runs fine against both VictoriaMetrics datasources, but fails when ran against Promxy.

jacksontj commented 5 months ago

Well, I can't seem to reproduce the nil pointer error!

Well thats at least some good news.

The range vector error still exists and, to be clear, the query without range and resolution runs fine against both VictoriaMetrics datasources, but fails when ran against Promxy.

Unfortunately there is little to be done on this. I actually get the same error running this query against prometheus directly as well. Getting a little into the detail there (hopefully providing some more context on the VM response); basically the promql spec requires that range to be defined; but VM has some automagic logic (VictoriaMetrics allows skip this field and detect it automatically, based on distance between sample timestamps.) -- which doesn't exist in the promql library promxy is based off of. So until upstream prometheus supports that query -- or VM's promql library is refactored to be re-used here there's not a lot to do for this issue.

As much as I hate to say it -- its "working as expected". This is a somewhat unfortunate side-effect of some of the VM "improvements" as a lot of these deviate from the spec/standard which causes some of these edges.

CH-DrewWatson commented 5 months ago

Was hoping for a a more favorable solution, but I get it, and thank you for your help!

jacksontj commented 5 months ago

You are welcome, and it is unfortunate but have to draw the line somewhere. Maybe the VM guys can make a simple proxy (or library) that does all of these edge-case promql rewrites. There are a few like this where it changes the actual query transparently; it'd be nice if they could provide a library that would just rewrite the query -- then we could have that as an optional flag on promxy (since it would translate before all the promql logic)