jacksontj / promxy

An aggregating proxy to enable HA prometheus
MIT License
1.16k stars 128 forks source link

`offset` modifier causing OOM #678

Closed arkbriar closed 1 month ago

arkbriar commented 1 month ago

Hi!

I have a promxy server to form a HA query frontend against two identical VictoriaMetrics. It worked perfectly!

However, I found that some special queries will make promxy OOM. And I just found a strange thing: when I removed the offset modifier in one of the queries, the OOM's gone. Here's the original query:

max(max_over_time(sum(rate(stream_executor_row_count{namespace=~"default"}[30s] offset 34m46s))[12h2m27s:30s]))

The promxy has 4 GiB to use and it normally uses ~200 MiB. Therefore, I guess that there's some magic behind the offset. Also, I saw this PR that handles the offset specifically.

I'd like to dig into the codes later when I have time, but I think you folks are more professional and can definitely provide more insights so I opened this issue first. I will post more findings if any.

arkbriar commented 1 month ago

Update: I can reproduce it with local repo. The process used 3.19 GiB with offset and 40 MiB without. Let me try digging into it.

arkbriar commented 1 month ago

Without offset, promxy sends QueryRange requests

DEBU[2024-09-10T19:08:01+08:00] http://localhost:8428                         api=QueryRange query="sum(rate(stream_executor_row_count{namespace=~\"default\"}[30s]))" r="{2024-09-08 22:26:30 +0000 UTC 2024-09-10 10:28:40 +0000 UTC 30s}"
DEBU[2024-09-10T19:08:02+08:00] http://localhost:8428                         api=QueryRange query="sum(rate(stream_executor_row_count{namespace=~\"default\"}[30s]))" r="{2024-09-08 22:26:30 +0000 UTC 2024-09-10 10:28:40 +0000 UTC 30s}" took=1.196407416s

And with offset 30m, it sends GetValue requests

DEBU[2024-09-10T19:08:29+08:00] http://localhost:8428                         api=GetValue end="2024-09-10 09:58:40 +0000 UTC" matchers="[namespace=~\"default\" __name__=\"stream_executor_row_count\"]" start="2024-09-08 21:55:43 +0000 UTC"
DEBU[2024-09-10T19:08:57+08:00] http://localhost:8428                         api=GetValue end="2024-09-10 09:58:40 +0000 UTC" matchers="[namespace=~\"default\" __name__=\"stream_executor_row_count\"]" start="2024-09-08 21:55:43 +0000 UTC" took=28.159786167s
DEBU[2024-09-10T19:08:57+08:00] Select                                        matchers="[namespace=~\"default\" __name__=\"stream_executor_row_count\"]" selectHints="&{1725832543000 1725962320000 345000 rate [] false 30000 false}" took=28.160525958s

I think that's the problem. It lies in the rewriting of the query at the AST level. But I don't understand the rationale behind the design. Would anyone please help?

arkbriar commented 1 month ago

Hi @jacksontj, I think I've fixed it in #681. Could you help take a look when you have time? Thanks a lot!

Also, thanks for creating such a tool! It's really amazing!

jacksontj commented 1 month ago

Thanks for the report; this is actually a VERY interesting issue. Your PR actually was SUPER helpful in parsing your report here -- and was amazing, thanks!

Now I have opened #683 which I believe will fix the issue. A longer description is in the PR but TLDR when I added subquery support I did fail to update the Offset selection for subqueries -- so this should cover it.

I did some testing and it seems fine (and I added a test for it), mind giving that PR build a spin to see if that alleviates your issue? If not we can potentially grab some new data and continue from there.

arkbriar commented 1 month ago

Now I have opened https://github.com/jacksontj/promxy/pull/683 which I believe will fix the issue.

Thanks for the fix!

mind giving that PR build a spin to see if that alleviates your issue?

No problem! I've tested it. The logs look good!

DEBU[2024-09-16T15:30:33+08:00] http://localhost:8429                         api=QueryRange query="sum(rate(stream_executor_row_count{namespace=~\"default\"}[30s]))" r="{2024-09-15 17:53:44 +0000 UTC 2024-09-16 06:55:47.067 +0000 UTC 30s}"
DEBU[2024-09-16T15:30:33+08:00] http://localhost:8428                         api=QueryRange query="sum(rate(stream_executor_row_count{namespace=~\"default\"}[30s]))" r="{2024-09-15 17:53:44 +0000 UTC 2024-09-16 06:55:47.067 +0000 UTC 30s}"
DEBU[2024-09-16T15:30:33+08:00] http://localhost:8429                         api=QueryRange query="sum(rate(stream_executor_row_count{namespace=~\"default\"}[30s]))" r="{2024-09-15 17:53:44 +0000 UTC 2024-09-16 06:55:47.067 +0000 UTC 30s}" took=322.941083ms
DEBU[2024-09-16T15:30:33+08:00] http://localhost:8428                         api=QueryRange query="sum(rate(stream_executor_row_count{namespace=~\"default\"}[30s]))" r="{2024-09-15 17:53:44 +0000 UTC 2024-09-16 06:55:47.067 +0000 UTC 30s}" took=322.528792ms

I think https://github.com/jacksontj/promxy/pull/683 is good to go.

BTW, could you elaborate more on the reason for the sum(rate()) query being pushed down to the downstream servers? Cuz I found that if we'd like to refill missing data (holes) in the sources and get an accurate result, the only way is to pull the data up to promxy and do the calculation there. Could you help me understand the design decision?

jacksontj commented 1 month ago

could you elaborate more on the reason for the sum(rate()) query being pushed down to the downstream servers?

So this is a bit tricky -- and you do raise a good point here. To start off; lets talk about why we do query pushdown -- and it pretty much all comes down to serialization and bandwidth. The initial PoC implementation of promxy (prior to open source) was actually without NodeReplacer -- but the performance was SUPER SLOW. After doing a lot of digging the issue basically just boiled down to the downstream APIs having to serialize large sums of data and ship it across the wire. So the "query push down" is intended to offload some compute but primarily to reduce the amount of data we have to move around.

So with that in mind; assuming there are "no holes" pushing down a sum is safe as a sum of sums is still the sum (effectively sum is a re-entrant aggregator). That pushdown has been the case since promxy v0.0.1. So for a subquery it is "the same" as this mechanism so I called it "safe".

I found that if we'd like to refill missing data (holes) in the sources and get an accurate result, the only way is to pull the data up to promxy and do the calculation there.

You do bring up an excellent point here -- and TBH this is a bit of a tricky one. The initial design for promxy was an aggregating proxy in front of some sharded prometheus hosts. In that model (at least historically) if there are holes it is generally "all series" with holes (i.e. prometheus restart/crash). But there definitely could conciveably be scenarios where a single prometheus host has an issue getting data from a specific endpoint (i.e. network policy that applies to only 1 replica of prometheus) -- and in that case we would see issues with the data calculation. Fundamentally promxy is trying to strike a balance between correctness and performance -- and so far this mix has been sufficiently good.

That being said; now that I'm thinking through it a bit more (and describing it out here) we could in theory make the pushdown have some group by mechanism (i.e. by instance) so that we can attempt to "do better" on data fill (at the expense of more serialization/bandwidth). I'll have to think on that a bit; I'm not sure if there is a good/safe way to do that without causing a potentially massive/catastrophic increase in bandwidth/memory (extreme case being sum of all instances in a large deployment -- the cardinality there may be high).