thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.73k stars 2.04k forks source link

Prometheus Federate API endpoint #5541

Open aslafy-z opened 1 year ago

aslafy-z commented 1 year ago

Is your proposal related to a problem?

I'd like my federated prometheuses instances to use thanos as datasource instead of the underlying prometheus. That would offer deduplicated metrics without the need of adding a querier sidecar to all of them.

Describe the solution you'd like

Expose a feature complete /federate endpoint to be used with Prometheus Federation feature.

Describe alternatives you've considered

Build or use an existing proxy that translate the Query API. Found https://github.com/snapp-incubator/thanos-federate-proxy and https://github.com/G-Research/thanos-remote-read that does not look well maintained.

Additional context

Previous issues:

Previous implementations:

Slack discussions:

yeya24 commented 1 year ago

That would offer deduplicated metrics without the need of adding a querier sidecar to all of them.

If using the existing proxy API way like exemplars, rules, the logic is a bit different. There is no deduplication for two Prometheus and we still reserve the external labels. If we want to do deduplication then it is doable but logic is more complicated.

stale[bot] commented 1 year ago

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

aslafy-z commented 1 year ago

This is still something I'd love to see in Thanos.

yeya24 commented 1 year ago

I have seen some use cases that require the federation API. Specifically having federation API for Ruler to expose the record metrics in TSDB. Although this issue mainly targets Thanos Query, we can revisit it to see if we want to support it or not.

rouke-broersma commented 1 year ago

We currently use prometheus federation to pull metrics from remote prometheus instances for centralized alerting on some managed environments, however this causes a lot of duplicate metric warnings because we run multiple instances of prometheus on the remote side that collect the same metrics. We use thanos on the remote side to dedup the data, so we would love to pull the federated data from thanos instead of from prometheus.

iamyeka commented 1 year ago

We currently use prometheus federation to pull metrics from remote prometheus instances for centralized alerting on some managed environments, however this causes a lot of duplicate metric warnings because we run multiple instances of prometheus on the remote side that collect the same metrics. We use thanos on the remote side to dedup the data, so we would love to pull the federated data from thanos instead of from prometheus.

Same use case

HarshitaJha commented 1 year ago

We currently use prometheus federation to pull metrics from remote prometheus instances for centralized alerting on some managed environments, however this causes a lot of duplicate metric warnings because we run multiple instances of prometheus on the remote side that collect the same metrics. We use thanos on the remote side to dedup the data, so we would love to pull the federated data from thanos instead of from prometheus.

Hi. Is there any update on this issue? We have a similar usecase.

luisdavim commented 10 months ago

Has anyone tried this? https://github.com/snapp-incubator/thanos-federate-proxy recently?

baryluk commented 1 month ago

I would also like this. (I think I commented on some other bug about efficiency of /api/v1/query some time ago, but I could not find it today).

We have some k8s clusters with Thanos (mostly OpenShift), but we prefer to build some data sets, alert rules and dashboards outside of OpenShift, because of how easier we can make it using our own tools, compared to what OpenShift or in general k8s manifests can achieve. In fact we would not use Thanos at all, if it was not because OpenShift uses it by default, and there is no way to disable it, and there is no easy way to expose Prometheus / Thanos normal query interfaces to outside of the cluster (OpenShift disables them, and trying to enable them, they will reconverge back to blocked).

Currently we use a simple python program between prometheus and thanos, called thanos_converter that simply does /api/v1/query?query={namespace="foo"}, then on the fly converts json to text metrics exposition format. It works okish, but is really slow, causes Thanos do load entire result to memory first, serializes to json, then starts outputing, instead of streaming data on the fly. That can easily take 30 seconds, before the first byte of response. Then my converter uses Python ijson library (incremental json parser) to convert response on the fly to exposition format (300MB uncompressed, as it is on localhost anyway), in about 10 seconds.

But now this uses a lot of CPU and memory on Thanos side, a lot of CPU in the converter (~10 seconds of full CPU usage on 1 core in the converter), and is close (~40 seconds) to our scrape interval of 60 seconds.

We have been running this thanos_converter for about 3 years, and I never liked it too much (because of Thanos poor performance), but it looks like https://github.com/snapp-incubator/thanos-federate-proxy is doing very similar thing (with exception that our code uses way less memory, due to response streaming).

I am attaching the code, how to use should be trivial (just run it with --help, and set few params), as how to scrape it. MIT license, Witold Baryluk, 2021-2024.

thanos_converter.tar.gz

JQLSpec commented 4 days ago

Has there been additional work on this? I have a use case where it would be very handy to be able to scrape metrics from a Prometheus-like federate endpoint to feed specific metrics into a SAAS observability system rather than the full Prometheus firehose. We are using Thanos to aggregate metrics from Prometheus in several Kubernetes clusters.