thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.74k stars 2.04k forks source link

Reduce unnecessary StoreAPI series calls #1611

Open brancz opened 4 years ago

brancz commented 4 years ago

Thanos currently performs a "possibly in set" check in order to exclude certain StoreAPIs from calls. This check currently only excludes a StoreAPI if all sets of external labels indicate that time-series selected by a certain query will never include data from this StoreAPI. While this check only has true positives, the potential for empty responses is still large, as it requires the query to truly select on the labels that a StoreAPI happens to expose.

In order to expand these "possibly in set" checks, I propose to have StoreAPIs additionally expose a probabilistic datastructure(s) via the Info API next to the sets of external labels. Datastructures such as Bloom filters or Cuckoo filters, could be a good candidate. This way we could in a space efficient way minimize unnecessary calls to StoreAPIs.

This would not only improve performance, but also reliability, as any additional and unnecessary network request has the potential to cause what seems like a partial response even when it's not.

In order to verify further, that this would be useful, I would propose to instrument the querier with metrics to verify that these types of empty StoreAPI series requests are actually happening and therefore impacting performance, otherwise this optimization is likely not worth it.

@bwplotka @domgreen @GiedriusS @metalmatze @squat @kakkoyun

squat commented 4 years ago

Bloom filters would be a great tool for this. +1 on investigating the false positive rate for querier api requests.

GiedriusS commented 4 years ago

In theory, it sounds good but I wonder about the performance implications of keeping that bloom map up-to-date, especially in the case of Sidecar/Prometheus. This probably makes the most sense in environments where are a bunch of disparate Thanos Query instances and there is one "on top" which is being queried. Either way, it would be nice to have such a metric so that we could rationalize this idea better.

bwplotka commented 4 years ago

In my eyes there will just small and very often used portion of labels maintained by such filter, so it should be fine.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

brancz commented 4 years ago

I think we have yet to instrument the proxy store type (used with the query component) to validate whether and how much this is happening. I do suspect that it will highly vary across environments.

brancz commented 4 years ago

With #2030 merged we should be able to at least identify the potential for this.

stale[bot] commented 4 years ago

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

jojohappy commented 4 years ago

Maybe it's still valid.

brancz commented 4 years ago

Looking at the metrics in our prod environment, the metrics seem to show that we do a lot of requests that return empty responses.

stale[bot] commented 4 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

brancz commented 4 years ago

As mentioned in the previous comment, there is definitely room for improvement still.

stale[bot] commented 4 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 4 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

GiedriusS commented 4 years ago

Still valid.

stale[bot] commented 3 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 3 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

yeya24 commented 3 years ago

Do we have any plans/design docs about this feature? I have some time recently so maybe I can do some experiments first.

One simple idea is to get all metric names (/api/v1/label/name/values) for each store in Info() StoreAPI and keep them in a bloomfilter in StoreRef. Only keeping metric names cannot work for queries without specifying __name__, but it works for most cases and I think it is a good start.

The problem is that Info API is called every 5s each store, so I am not sure about the performance overhead (Might be big because we need to query all metric names in the store's time range).

For updating/deleting the elements in the bloomfilter, how to make it efficient in each update?

kakkoyun commented 3 years ago

Do we have any plans/design docs about this feature?

I don't think we have. It would be great place to start when one decides to pick up this issue. We can elaborate more on those challenges you have mentioned.

Also there's also a bit of experimentation needed to pick the correct approach/data structure for implementation. I have come across some nice articles and projects recently worth looking before start. (Haven't dug deep though) On Bloom Filter efficiency: https://gopiandcode.uk/logs/log-bloomfilters-debunked.html On Quotient filters: https://github.com/facebookincubator/go-qfext

stale[bot] commented 3 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

brancz commented 3 years ago

I think we have yet to measure how often were unnecessarily hitting stores to validate further work.

stale[bot] commented 3 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

kakkoyun commented 3 years ago

Still valid, help wanted.

stale[bot] commented 3 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

GiedriusS commented 3 years ago

I have worked on this for a bit https://github.com/thanos-io/thanos/compare/main...GiedriusS:feature/bloom_metric_names and I have some insights:

Given this, I am not sure how useful this feature would be with bloom filters. Quotient filters do not seem to be any different in regard to these two points. I couldn't find any data structure which would be more useful with how labels/PromQL works.

yeya24 commented 3 years ago

This implementation looks nice. Great work! I feel we can have labels in the bloom filter as well and it should work. If one label is specified in one equal or regex match, then the label name needs to there. I think the label names and metrics names are almost the same for blocks in S3 so it doesn't help a lot in this case. The improvement would be adding label values to the bloom filter for equal match? It works well if you want to query some high cardinality labels like pod id. But this increases the bloom filter false positives, too...

stale[bot] commented 3 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 3 years ago

Closing for now as promised, let us know if you need this to be reopened! πŸ€—

GiedriusS commented 2 years ago

Maybe let's add an initial version based on label names? The matchers can only be exact there so bloom filters should work wonderfully there :thinking:

stale[bot] commented 2 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 2 years ago

Closing for now as promised, let us know if you need this to be reopened! πŸ€—

fpetkovski commented 1 year ago

Since some preliminary research has already been done, should we add a gsoc/lfx label on this issue?

GiedriusS commented 1 year ago

We talked a bit about this during KubeCon and I think the consensus is that we need some central place where we'd store the bloom/cuckoo/... filters. This is because if we would store them locally then in a distributed system we could get inconsistent results. For example, one Thanos Query could retrieve data from nodes A, B whereas another Thanos Query could have newer information and it would retrieve data from nodes A, B, C. This would lead to inconsistent results. Hence, something like etcd/Consul/... is needed to store these filters for each StoreAPI.

I think it's still doable during GSoC/LFX.

fpetkovski commented 1 year ago

Hm, isn't that the same case we have with filtering out stores based on external/announced labels?

GiedriusS commented 1 year ago

Hmmm, that's true :thinking: then maybe perhaps we wouldn't be introducing any new inconsistency issues. Let's mark it as a GSoC/LFX project.

yeya24 commented 1 year ago

One point to discuss: whether we want to use something like Info API at central Queriers to collect the bloom filters info at central or not.

If queriers collect that info periodically, we will get:

  1. Probably inconsistency because fetch may have delay while store's bloom filter gets updated.
  2. Better performance since Querier can make the requests routing decision locally based on the collected information at query time

If we add another API for each store to test whether some metric exists or not, this always involves a RTT.

fpetkovski commented 1 year ago

For the inconsistency problem, I think we can make it manageable if we only keep metric names in bloom filters. They are usually slow to change. Usually a metric will stay for a long time in a store and once it goes away, it's still okay for the querier to reach out to that store.

yeya24 commented 1 year ago

I don't think we can make the assumption that metrics names are slow to change. Whenever a new block gets loaded we cannot tell whether new metrics have been introduced or not. Unless some push model is used, with current pull model we cannot get always up-to-date data.

fpetkovski commented 1 year ago

Correct, but loading a new block already has an expected delay, which means we can tolerate a 5-10s delay when announcing new metrics from this block. Maybe this bloom filter can be rebuilt when a block is loaded?

The bigger challenge is in-memory ingestion like Prometheus and Receivers, but if we keep the delay to a few seconds, I don't think it will be a big issue. There might be some inconsistency when one Prometheus replica has announced a metric and another hasn't, but I guess this is why we duplicate data so it won't be that different from a partial read.

yeya24 commented 1 year ago

The bigger challenge is in-memory ingestion like Prometheus and Receivers, but if we keep the delay to a few seconds, I don't think it will be a big issue. There might be some inconsistency when one Prometheus replica has announced a metric and another hasn't, but I guess this is why we duplicate data so it won't be that different from a partial read.

Right. I think we don't have a good way to deal with this case now. For either Prometheus sidecar or receiver mode, how is the bloom filter maintained for in-memory data?

Receiver mode seems fine since we can introduce a bloom filter and update it when writing. But sidecar mode is hard and we can only do pulling periodically.

yeya24 commented 1 year ago

I am still thinking about using the trigram index mentioned in the Monarch paper https://www.vldb.org/pvldb/vol13/p3181-adams.pdf.

This is similar to the bloom filter way but works better in the regex match case. But the same problem here is that I don't see a good way to maintain the trigrams in memory, especially for Prometheus and Receivers.

image
fpetkovski commented 1 year ago

I think for receivers we might be able to build the index during ingestion time and maybe rebuild it on boot if there is a crash.

I agree that for the sidecar it will be quite challenging. Maybe we could do labels/__name__/values calls against Prometheus on periodic basis.