thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.12k stars 2.1k forks source link

Implement multi-tenant Ruler: multitsdb and multiagent #5133

Open saswatamcode opened 2 years ago

saswatamcode commented 2 years ago

Is your proposal related to a problem?

Currently, the Thanos Ruler has no built-in support for multi-tenancy like Receive. This creates issues when running it in a setup where we want to isolate tenants and store their rule-evaluated metrics in a different tsdb instance each. The only possible way might be using a Ruler for each tenant which is simpler but wasteful of resources.

Also, in the case of using Stateless Rulers, it's harder to achieve multi-tenancy, as different tenants might need different configurations while remote writing (write to separate locations with separate HTTP headers like THANOS-TENANT).

For example, consider a Receive with multiple tenants, to which a single Ruler might need to remote-write multi-tenant rule-based metrics and store it in the tenant's Receive tsdb. But in this case, the Ruler cannot add HTTP headers for each tenant, so it is treated as a completely new default tenant by Receive and new tsdb gets created. Ruler_Multi_Tenant_Problem

(Note: This is a separate problem from ensuring that Ruler only selects data from one tenant while evaluating rules.)

Describe the solution you'd like

A potential solution would be using the Receive multitsdb in Ruler and having the same flags for tenancy as Receive (--receive.default-tenant_id, --receive.tenant-label-name). So the Ruler would be tenant-aware and store evaluated metrics in a different tsdb instance for each tenant using the tenant_id label to identify what rule-based series belongs to which tenant (assuming that the rule file configuration will specify the tenant label for each rule).

This can be extended to Stateless Ruler and allow separate remote write configs for each tenant. This would start an agent, i.e, a WAL-only storage for each tenant which remote-writes to only locations that were configured for that tenant. In essence, a multiagent package, would be needed to be able to handle this.

The addition of multitsdb to Ruler can also be skipped as the Scalable Rule proposal does mention the removal of embedded tsdb to be in the work plan! :) Rules_multiagent

Describe alternatives you've considered

Running a Ruler for each tenant.

Open to feedback and suggestions! If there are existing solutions/configuration options for achieving the same result which will be easier to implement than the above idea, that would be great too! πŸ™‚

matej-g commented 2 years ago

We discussed this briefly with @saswatamcode with one more suggested alternative from me, which would be to have a separate remote write config for each tenant, set the tenant header and use relabeling to only forward metrics which are applicable to that tenant. However, this is not really a systematic solution and require to always manually set up the remote write config for each tenant. The proposal solution seems reasonable to me :+1:.

bwplotka commented 2 years ago

Hey, just trying to understand the main problem we are discussing here.

The only possible way might be using a Ruler for each tenant which is simpler but wasteful of resources.

Do we have any data on this? Because for stateless rulers there is not much baseline overhead for this situation. I would even say, the more problematic thing is the extreme situation where one tenant has too many rules and alerts for one ruler.

A potential solution would be using the Receive multitsdb in Ruler and having the same flags for tenancy as Receive

Do you mean sending things to Receive that uses multitsdb or literally using multitsdb code?

This would start an agent, i.e, a WAL-only storage for each tenant which remote-writes to only locations that were configured for that tenant. In essence, a multiagent package, would be needed to be able to handle this.

I would really avoid doing that - multi-tsdb is already a tough idea - every new TSDB has a lot of costs to be started and reloaded. Not sure if we want to replicate this idea for agent code.

Also, in the case of using Stateless Rulers, it's harder to achieve multi-tenancy, as different tenants might need different configurations while remote writing (write to separate locations with separate HTTP headers like THANOS-TENANT).

Right. We need essentially something like this:

image

I feel we should have multi-tenant rulers that can do any number of tenants rules (tenant agnostic) and we build tenancy with label aware sharding on receiver. Receive router already checks EACH series in write request and distribute with hashring - so why not checking tenant label there?

stale[bot] commented 2 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] commented 2 years ago

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€— If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

yeya24 commented 2 years ago

Would love to see this moving forward. A general sharder is really something we need in Thanos. Cortex has something similar like this using the Ring. In Thanos, we have the hashring only on the receiver side. However, if we want to distribute works like rules, compaction jobs, etc. We don't have a good way now.

saswatamcode commented 2 years ago

Yup! I'm writing a proposal + poc for this currently. Will land soon! πŸ™‚

benjaminhuo commented 2 years ago

Yup! I'm writing a proposal + poc for this currently. Will land soon! πŸ™‚

Looking forward to this feature!

anarcher commented 11 months ago

How is ruler sharing going? :-) As a cortex user, this feature was useful.

benjaminhuo commented 3 months ago

I feel we should have multi-tenant rulers that can do any number of tenants rules (tenant agnostic) and we build tenancy with label aware sharding on receiver. Receive router already checks EACH series in write request and distribute with hashring - so why not checking tenant label there?

Does https://github.com/thanos-io/thanos/pull/7256 already implement this feature? @bwplotka @GiedriusS