pingcap / tiproxy

Apache License 2.0
55 stars 27 forks source link

How to exclude a tidb from tiproxy? #609

Closed salarali closed 1 month ago

salarali commented 1 month ago

I want to bring up tidb's that I do not want to be routed through tiproxy. Can you point me towards a setting that will let me add a tidb without it being added to the list of available servers in tiproxy?

djshow832 commented 1 month ago

The only way I can think of is to set the zone label of TiProxy and TiDB and then set the balance policy to location.

E.g. The config of TiProxy:

labels={"zone"="cluster1"}
[balance]
policy="location"

The config of TiDB:

labels={"zone"="cluster1"}

The excluded TiDB (or simply not set it):

labels={"zone"="cluster2"}

All the configurations can be set through the HTTP API without restarting TiDB/TiProxy.

djshow832 commented 1 month ago

Actually, I reserved the config balance.label to isolate TiDB. It tells TiProxy to route to the TiDB instances with the same label. But I didn't document it because I needed to rethink the feature.

E.g. The config of TiProxy:

labels={"cluster_name"="cluster1"}
[balance]
label="cluster_name"

The config of TiDB:

labels={"cluster_name"="cluster1"}

The excluded TiDB (or simply not set it):

labels={"cluster_name"="cluster2"}

Can you tell me the scenario of why you need to isolate TiDB so that I can design the feature in the future?

djshow832 commented 1 month ago

I'm really interested in your workload. It needs auto-scaling, load balance, and resource isolation, all of which are the advantages of TiProxy. I'll appreciate it if you can tell me why you need this and what other features you expect.

salarali commented 1 month ago

Our traffic pattern is very cyclic. So more requests in the morning and less at night. There might also be bursts of requests at certain times of the day, depending on certain factors. Turning on auto scaling gives us a significant cost benefits where we can save around 10-20% on compute cost for the tidb nodes. The other benefit being less management headache. We don't need to over provision or keep a close eye on capacity of the tidb nodes. We have set it so that it automatically scales up when group cpu level reaches a certain point.

The resource isolation scenario is something we recently found that we need. There are two types of workloads right now. One is very high qps of a lot of writes. Another is an hourly job that reads a bunch of data and stores it somewhere else. For the hourly job, we are unable to reduce the query connection time. It can take up to 10 minutes or more for a single query to finish. So what is happening is that during auto scaling, we are terminating connections for the hourly job making it fail. The hourly job also is very memory intensive so most of the queries oom. What we are thinking is that we give a dedicated tidb cluster with different parameters ( not auto scaling, different memory configs) for the hourly job.

djshow832 commented 1 month ago

We basically have 2 plans for compute-layer isolation:

That's why I was hesitating. I'm not sure which one is the best way for users. Any insights?

BTW, did you consider using the resource control to isolate TiKV resources? Does it make sense to you if TiProxy combines the compute-layer isolation with resource control (especially SQL)? (e.g. user A belongs to resource group R and TiProxy should assign 3 TiDB instances for group R).

salarali commented 1 month ago

I would prefer a single TiProxy instance, that is able to separately load balance between the two tidb clusters. This is the approach we will be using for now, where we will be using the single envoy load balancer to route to the two tidb clusters. The reason being that 1 load balancer is much easier to manage then two.

Haven't looked too deeply into resource control yet. Maybe that will be beneficial to our use case.

djshow832 commented 1 month ago

In either solution, TiProxy will separately load balance between the two TiDB clusters. If choosing solution 2, maybe you don't need an extra envoy? What routing rule is envoy based on? I can only find the methods zonal load balance and the original destination service discovery.

salarali commented 1 month ago

Yeah. We were thinking of moving away from tiproxy temporarily and have a single envoy serve two different tidb clusters. The routing will work with just having two cluster and two different routes.

But, since we got this working, we have two sets of tiproxy's instead.