thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.13k stars 2.1k forks source link

[Receive] New Tenant is not queryable by receiver #7892

Closed jnyi closed 1 week ago

jnyi commented 2 weeks ago

We are testing the latest thanos main branch and found a regression that didn't exist in v0.36 prior

For a given running thanos receiver cluster, we start a new tenant called "eng-host-networking" and we can see tsdb head metric started pop up but all metrics to that tenant are not queryable unless restart the receiver cluster

Screenshot 2024-11-07 at 10 14 01 AM Screenshot 2024-11-07 at 10 14 09 AM

How to repro:

  1. start a receiver cluster using latest main
  2. send remote write with a new tenant
  3. verify it is received using prometheus_tsdb_head_series{tenant="<new tenant>"}
  4. go to endpoint pages, the new tenant isn't there nor metrics from that tenant can be queried
  5. restart receiver cluster, data showed up (no data loss)

Thanos, Prometheus and Golang version used: Thanos: v0.37.0-dev Golang: v1.23

Object Storage Provider:

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Anything else we need to know:

dosubot[bot] commented 2 weeks ago

I found a similar open issue that might be related to your problem: Thanos query don't query all metrics on a receiver. In that issue, the solution involved configuring the querier to query all pods in the receiver ring by using a headless service. This was done by specifying each pod's endpoint in the configuration, rather than using the service endpoint. Additionally, they adjusted the hashring.json configuration to include all receiver endpoints. This change ensured that all metrics were consistently queried, although they noted that scaling the receiver ring required changes in multiple places.

For your specific issue, it seems like the new tenant is not being dynamically registered for querying without a restart. The Thanos receiver component should dynamically detect and register new tenants without requiring a restart by provisioning an independent TSDB for each new tenant detected via the tenant HTTP header [1]. However, if this is not happening, it might be a regression or a configuration issue in the latest main branch.

You might want to check if the configuration flags related to tenant registration and queryability are correctly set, such as --receive.tenant-header and any relevant querier flags [2][1]. If the issue persists, it could be beneficial to report this as a potential regression in the main branch, providing the details and steps to reproduce as you've outlined.

To continue talking to Dosu, mention @dosu.

jnyi commented 1 week ago
ts=2024-11-13T06:33:34.121916792Z caller=multitsdb.go:697 level=info name=pantheon-db component=receive component=multi-tsdb tenant=random-tenant123 msg="opening TSDB"
ts=2024-11-13T06:33:34.128508813Z caller=multitsdb.go:743 level=info name=pantheon-db component=receive component=multi-tsdb tenant=random-tenant123 msg="TSDB is now ready"
ts=2024-11-13T06:33:50.252309547Z caller=shipper.go:259 level=warn name=pantheon-db component=receive component=multi-tsdb tenant=random-tenant123 msg="reading meta file failed, will override it" err="failed to read /var/thanos/data/random-tenant123/thanos.shipper.json: open /var/thanos/data/random-tenant123/thanos.shipper.json: no such file or directory"

Tested in latest main, this behavior didn't happen:

Screenshot 2024-11-12 at 10 37 20 PM Screenshot 2024-11-12 at 10 37 28 PM