thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.01k stars 2.08k forks source link

Query: `/api/v1/stores` confusingly returns endpoints not exposing store API #5238

Open matej-g opened 2 years ago

matej-g commented 2 years ago

Thanos, Prometheus and Golang version used:\ Thanos v0.25.0

What happened: We recently switched to using stateless ruler. Upon visiting the Store page on UI, I still see the rulers, despite the fact that they do not expose store APIs and there are no store information available. See: Screenshot from 2022-03-16 11-51-23

What you expected to happen: For API / stores UI page to only display components actually exposing API.

How to reproduce it (as minimally and precisely as possible): Run an instance of querier which points to a stateless ruler.

Anything else we need to know: I think the logic responsible for collecting store information (https://github.com/thanos-io/thanos/blob/main/pkg/api/query/v1.go#L719) should filter these components out.

cc @saswatamcode

yeya24 commented 2 years ago

I am trying to understand this problem better. Did you configure the stateless ruler endpoints on Thanos Querier? If you didn't configure it then they shouldn't appear on the UI.

saswatamcode commented 2 years ago

@yeya24, we configure stateless Ruler as a store in Querier, the reason being that we want to utilize the Querier /api/v1/rules endpoint and see the rules in Querier UI too. If we don't configure it as a store we cannot get any info from the api/v1/rules endpoint.

But it also shows up as a store, even though it doesn't expose any StoreAPI, which is misleading. So they should be filtered out instead. I think current filtering logic is just based on component name.

Alternatively, maybe a flag like --ruler instead of --store on querier would work better in Querier to support this particular case? 🙂

yeya24 commented 2 years ago

@yeya24, we configure stateless Ruler as a store in Querier, the reason being that we want to utilize the Querier /api/v1/rules endpoint and see the rules in Querier UI too. If we don't configure it as a store we cannot get any info from the api/v1/rules endpoint.

But it also shows up as a store, even though it doesn't expose any StoreAPI, which is misleading. So they should be filtered out instead. I think current filtering logic is just based on component name.

Alternatively, maybe a flag like --ruler instead of --store on querier would work better in Querier to support this particular case? 🙂

Makes sense to me. Yeah the filtering logic should be improved. Btw the info rpc contains the store API field to tell whether store API is exposed from this component. For statelessruler this shouldn't be set.

matej-g commented 2 years ago

As far as I understand, the issue is that now --store flag ≃ --endpoint flag. The store method in API indiscriminately returns all endpoints, even if they do not expose store API. I think stateless ruler already does not expose store API, but currently that is causing the stores endpoint excluding it. Same would be with any other non-store component passed to --store / --endpoint parameter I believe.

bwplotka commented 2 years ago

There is --rule flag: https://github.com/thanos-io/thanos/blob/79e70da702228ac0282fc7639f9f160d922b6dcb/cmd/thanos/query.go#L114-L113

We have now incident which we are handling with @moadz that adding stateless ruler as store potentially causes problems

bwplotka commented 2 years ago

And yes not finished endpoints work is extremely confusing

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

jlarsonta commented 3 days ago

Hi @bwplotka, I've been brushing up on this topic across multiple issues/proposals and the behavior described above where:

--store flag ≃ --endpoint flag

blocks us from upgrading from Thanos v0.23. Our issue is that we run well over a thousand Prometheus servers that each have a lot of rules. When Grafana or a user (via the Rules page in Thanos) hits the rules API in Thanos, the queriers in the middle OOM due to the size of the combined ruleset. Simply increasing memory is not an option here, there are too many rules.

I'm curious if there's any traction on this as of recently. While the upgrade we attempted was from v0.23 to v0.31 last year, upon revisiting these open issues and the current config options in the latest version, it doesn't seem like there is a lever that disables the rules API discovery in order for us to upgrade. We'd like to finally upgrade and make use of all of the new features and enhancements, so I'd really appreciate some advice here. Thanks!