thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.13k stars 2.1k forks source link

Query Pushdown #305

Open jacksontj opened 6 years ago

jacksontj commented 6 years ago

Hello thanos people! I just ran into your project the other day and wanted to chat with someone about some potential performance improvements for the Querier nodes. Having read a little through the docs it sounds like the querier nodes basically just aggregate data from N prometheus servers into a single endpoint. I have a project https://github.com/jacksontj/promxy which does exactly that. I started back in October of last year, and since then have a lot of perf learnings (specifically mechanisms to reduce the query-load on the prom system).

I'll start by stating some context, to help center the rest of the feedback. The first version of Promxy that I wrote was a simple implementation of the Storage interface for prom. This was then attached to an engine and the API and tada remote querier. The only real magic in that first version was how to merge datasets between hosts etc.

From my reading of the code and docs it seems that Thanos' Query service does the same thing.

There are a variety of optimizations I've done since that first version but the main 2 improvements I've done since are:

~1. Marshaling Performance In the case of a remote querier the amount of data being marshaled and shipped across the wire is a major concern. Since you are starting with the RemoteRead API I expect the marshaling latency isn't going to be a huge issue (its fast and code-gen). The main concern here is the memory pressure this puts on prometheus itself. On the JSON side this was really bad -- but the issue is similar on the proto side of things-- specifically that when you ask for N bytes of data that consumes at minimum 2N bytes of memory on the prometheus server (in the JSON case it was 14x, so proto is in a much better place). To solve this for the proto side would require implementing streaming to an io.Writer in the proto libraries. In my experience this was the most important feature not so* much because of the performance, but because of the stability-- without this feature you can easily exhaust memory on a remote prometheus host with a seemily "safe" query.~

EDIT(@bwplotka): This was done already.

2. NodeReplacer API By implementing the Storage API the querier needs to fetch all data in a window to calculate the response to the promql query. This is fine for single points or small windows, but when you get into the range of some dashboard queries (thousands of timeseries across days of duration) then this becomes overwhelmingly large. In some cases I've seen this "fetch all the data" step can take ~10x the time of funning the query (not to mention the CPU and network util on the remote end). The solution I came up with was the NodeReplacer API. This API allows you to hook up a function to the promql parsing to "replace" sections of the query as you see fit. This API is quite generic, but the use-case I have for it is actually relatively simple. The main goal is to take the large promql query and break it down into sub-queries that can be farmed out to other prom boxes to have them return an "aggregated" set of data over the wire. This will be easier to explain with an example:

Promql: sum(irate(http_request_duration_microseconds[1m]))

This (effectively) get parsed into an AggregateExpr -> Call -> VectorSelector. If we where going soely through the Querier API we'd have to return all datapoints for timeseries that match http_request_duration_microseconds. With the NodeReplacer API we can instead ask the downstream prom node for irate(http_request_duration_microseconds[1m]) and sum those together. This is done by replacing the AggregateExpr.Expr with the fetched data results in a VectorSelector.

TLDR; this means we can dramatically reduce the amount of data that needs to transfer for the vast majority of queries without changing the outcome of the query itself. I had actually submitted this PR to upstream prom last year (https://github.com/prometheus/prometheus/pull/3631) but was rejected, so as of now this is implemented in my fork on github (https://github.com/jacksontj/prometheus/tree/release-2.2_fork_promxy).

There are definitely a lot more things for us to chat about, but I think these are a good starting point to work from. Looking forward to making a better prom ecosystem!

bwplotka commented 6 years ago

Hey @jacksontj, thanks for all of these!

I think for Thanos querier, we have some minor advantage. We exactly know what "types" of underlying leafs we have. It will be either:

Totally see this problem, however, I think we avoid this problem, because for Thanos you can have really lightweight Prometheus scrapers (scraper = sidecar+ Prometheus), since all data is uploaded when it appears in disk, thus no need for special optimizations for large number of samples on Prometheus side. We can have relatively short metric retention (24h or even 12h).

BTW you did not finish the sentence:

In my experience this was the most important feature not so much because of the performance, but because of the stability you can

you can... ... achieve?

  1. NodeReplacer API

I totally see that use case. For remote query, query push down feature might decrease used network bandiwdth. I don't see any immdiate reason for query push down being bad idea for Prometheus, except the fact of making promQL engine more complex and easier to shoot yourself in a foot (:

Not sure how your deduplication code works though, but for Thanos I would be worried about deduplication correctness. It's bit fuzzy on raw data - With aggregated data I can imagine being it way worse or even impossible (we basically have less evidences what happened with the scrapes). How you solved that?

For Thanos we don't have query push down possibility, though: Our Store API fetches raw series only, so no way to push query down. Not without major changes to protocol and adding promQL engine to leafs. Wonder if some caching logic, would help a little bit here against Dashboards. Have you considered some caching layer in your promxy?

jacksontj commented 6 years ago

We exactly know what "types" of underlying leafs we have. It will be either:

Presumably the sidecar has the same (or similar) performance issues as prometheus itself? From looking at the code it seems that the sidecar only implements the storage API which means it requires shipping maximum data to the queriers for the querier to calculate an answer.

because for Thanos you can have really lightweight Prometheus scrapers (scraper = sidecar+ Prometheus),

Presumably the same issue happens for wherever the long-term storage exists? Good to know that thanos mitigates the prom issues, but I'd definitely suggest looking into marhasling performance (and memory consumption) of whatever your long-term storage system is.

BTW you did not finish the sentence:

I didn't, thats embarassing :) I've edited the original comment, and pasted the sentence down here:

In my experience this was the most important feature not so much because of the performance, but because of the stability-- without this feature you can easily exhaust memory on a remote prometheus host with a seemily "safe" query.

being bad idea for Prometheus, except the fact of making promQL engine more complex and easier to shoot yourself in a foot (:

The way I wrote the current one it is an API to hook up functions that know how to mutate the tree. My expectation would be that the functions that do the replacing (such as the one in promxy) would expose their structs as packages for others to use. But at the end of the day if you want more sophisticated control there is always chance for more bugs -- the key with the implementation I have is that its completely optional.

Not sure how your deduplication code works though, but for Thanos I would be worried about deduplication correctness.

Its fairly simple. The details are documented in the AntiAffinity config (https://github.com/jacksontj/promxy/blob/master/servergroup/config.go#L58) but the TLDR version is promxy tries to not merge unless there is a sufficiently large hole. In the servergroup configuration you tell promxy which servers are scraping the "same" things, so then when promxy determines that the "hole" is large enough then it will fill it with some other host. This AntiAffinity is required for clock-drift (and to deal with how prom stores times from scrape events). This optional "merging" only happens at the bottom most layer of queries.

adding promQL engine to leafs

Correct, this would require promql at the leafs. Alternatively you could just send the request to prometheus itself, but I'm not 100% clear on the distinction between the prometheus host and the sidecar/leaf (sounds like its mostly to do with metrics retention).

Have you considered some caching layer in your promxy?

Thought about it, but with the query push down it becomes not necessary at all. For caching you'd have to cache sections of the results, and depending on the promql evaluation done it may or may not be mergeable with other sources (for example, an avg calculation over one interval can't be used for another).

bwplotka commented 6 years ago

Presumably the same issue happens for wherever the long-term storage exists? Good to know that thanos mitigates the prom issues, but I'd definitely suggest looking into marhasling performance (and memory consumption) of whatever your long-term storage system is

Exactly that! I am only saying that this level of optimization would be required to the component that operates with long-term storage which is Store Gateway for Thanos, so we are not affected by Prometheus Remote Read limitations.

TLDR version is promxy tries to not merge unless there is a sufficiently large hole.

Ok, but it is done on your aggregated query level right? So if you push all queries down, this works on already aggregated (e.g irated, topk, histogram-ed) data. I still think that it can have different results when deduplicating aggregated samples vs raw samples, but need to look more into that tomorrow (: Not sure now.

Basically: I don't think there is a rule that says for any function ABC we guarantee that for resulted ABC(series)'s samples, they have totally the same timestamps and there are equaly same number of samples than in raw series

Alternatively you could just send the request to prometheus itself

Yea, I cannot, because we use gRPC service instead, but also, again, the main point where "query push down" or marshalling improvements would be beneficial is our store gateway which is totally a remote service to the Prometheus server.

For caching you'd have to cache sections of the results

Yea, but that's only the case when you actually do query push down which we don't do. You do - so understand you 100% here.

To sum up:

EDIT: query push down, might be tricky on time boundaries though for rate/irate/increase etc, anything that looks back to evaluate current value. The thing is, that we have downsampling that decrease the number of bytes send around.

bwplotka commented 6 years ago

Ah I missed you actually push down only some sub-queries.. hm

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

daixiang0 commented 4 years ago

@bwplotka still need this?

bwplotka commented 3 years ago

Starting formal design doc, feel free to collaborate (: https://docs.google.com/document/d/1ajMPwVJYnedvQ1uJNW2GDBY6Q91rKAw3kgA5fykIRrg/edit#

stale[bot] commented 3 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

kakkoyun commented 3 years ago

Still valid.

stale[bot] commented 3 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

kakkoyun commented 3 years ago

Still valid.

stale[bot] commented 3 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

GiedriusS commented 3 years ago

As mentioned on Slack, I suggest unblocking the work on this by starting with a simple case: we can push down any kind of query to a StoreAPI node if its time range doesn't overlap with any others. This should cover a lot already since the most recent data is usually covered by Receiver/Rule/Sidecar, and then the historical part is covered by a load-balanced Thanos Store. Most recent data typically only spans a day or two so IMHO it is fine to retrieving everything from there. Thus, with query-frontend splitting a query it means that we can already save a lot of data transfer over the wire.

We already do many tricks for downsampled data in iterators and naturally Select() is used for retrieving data (not the results) so it seems to me that a different solution is needed. TBH I like the NodeReplacer idea. I saw a PR where you were trying to contribute this to Prometheus but maybe a simple API to do this would be accepted now because there are more use cases for it?

Or maybe we could make it even simpler: pass everything down via gRPC to another PromQL engine and then just return its results as our (on the querier's side) results i.e. let's only have one engine? So, in a sense, Thanos Query would become a simple reverse proxy that proxies the query that it gets to a StoreAPI node if it is the only one covering that time range?

bwplotka commented 2 years ago

BTW we are moving with this work item finally πŸ€—

I made a quick PoC here:https://github.com/thanos-io/thanos/pull/4901 The first iteration is done by @fpetkovski https://github.com/thanos-io/thanos/pull/4917

We can work on first iteration, in the mean time we are working on the formal proposal.

stale[bot] commented 2 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

markmsmith commented 2 years ago

I think this is still needed.

stale[bot] commented 2 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

markmsmith commented 2 years ago

Still needed.

stale[bot] commented 2 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

markmsmith commented 2 years ago

Still needed