affinity based range placement

tirsen commented 3 years ago

Feature Request

Is your feature request related to a problem? Please describe:

For any given query or transaction there are benefits with decreasing the number of tikv nodes it interacts with. It improves performance to decrease the amount of bandwidth but primarily it improves availability and tail latency.

For that reason it would be good if TiDB relocated ranges to TiKV nodes based on how ranges are accessed.

Example: With TiDB 5.0 we will be able to use composite keys and clustered indexes to group related tables together with the following schema:

CREATE TABLE customers (
 id BIGINT AUTO_RANDOM,
 version BIGINT,
 name VARCHAR,
 PRIMARY KEY (id)
);

CREATE TABLE orders (
 id BIGINT AUTO_INCREMENT,
 customer_id BIGINT,
 -- other columns
 PRIMARY KEY (customer_id, id),
 FOREIGN KEY (customer_id) REFERENCES customers(id)
);

In our scenario we would have transactions that both operate on order rows and its parent customer row. We can have joins that join across the order rows and the parent customer row.

These tables would be stored in separate ranges because TiDB does not support interleaved tables but ideally these ranges would be stored in the same tikv node. This would decrease the number of 2PC transactions and potentially improve join performance if parts of the join operation can be pushed down into the tikv. (This assumes we can push down operations that work on multiple ranges inside the same tikv node which isn't currently the case but it could be a potential future optimization.) It would also improve availability and tail latency by decreasing the number of tikv nodes accessed by queries and transactions.

(This example is simplified, in reality there would be many "child tables" and transactions and queries spanning many of them.)

Describe the feature you'd like:

The ranges in the TiDB cluster are all nodes in a weighted graph. The weights on the edges represent the "affinity" between two ranges. The higher "affinity" the more likely it is used in the same query or transaction.

Whenever two ranges are accessed in the same query we increase the affinity.
Whenever two ranges are written to in the same transaction we increase the affinity.
Whenever two ranges are accessed in the same transaction we increase the affinity.

We also implement a "placement optimizer" which tries to gradually shift ranges around to maximize the weighted affinity edges between ranges inside the same tikv node.

We would need to engineer this carefully so that it doesn't conflict with the other range balancing processes running inside a TiDB cluster. In particular it may be beneficial to keep high affinity ranges together even if the cluster isn't entirely evenly balanced. We call this a "bieber shard" :-)

Describe alternatives you've considered:

This is an alternative to interleaved tables https://github.com/pingcap/tidb/issues/19404 but this seems easier to implement and maintain long term. It would not need to change the underlying key-value layout of tables and it can re-use existing rebalancing.

Another option is to manually relocate ranges to tikv nodes using placement rules but this sounds like a huge burden to the end user/operator.

Teachability, Documentation, Adoption, Migration Strategy:

This is an optimization that increases availability, performance and tail latency without being visible to the end user.

gregwebs commented 3 years ago

With respect to 2PC transactions note that in the region based system any transaction across 2 regions will necessitate a 2PC regardless of whether it is on the same TiKV node. Staying on the same node though does make a lot more optimizations possible.

I think static placements such as interleaved tables are usually easier to implement reliably and to understand their behavior. Of course they have the downside of not being smart enough. Dynamic movement of regions though has the downside of creating overhead in the cluster.

zz-jason commented 3 years ago

with the weight graph, we can convert this problem to a classic clustering problem

zz-jason commented 3 years ago

considered that we have lots of ranges (for example, 2~3 million), we need to carefully design the method to maintain the weighted graph and the algorithm to efficiently identify all the related ranges.

bezmax commented 3 years ago

Some things to think about:

Range splitting. Should affinity be pessimistically zeroed out on a split? Or optimistically weighted-split across the sub-ranges? Probably zeroing out is good enough, as it should be repopulated fast enough.
Prioritizing rebalancing of ranges. On every update of affinities some sort of "placement inefficiency coefficient" can be calculated, which determines "how far from perfect placing is this range?". For example, if affinity between A and B is 10, and affinity between A and C is 15, but A is co-located with B currently, then inefficiency coefficient is 5. Then the inefficiently placed ranges could be put into a priority queue by that coefficient, to be picked up by some worker. This way you can control how many reshuffling is happening at the same time, while prioritizing the most important ranges.
How to calculate the affinity? A simple "queries hitting this edge" might not be enough. By number of rows returned? By total size of rows returned? By "time spent querying this edge"? Some mix of those?
Tackling common use patterns. A common use-case is that right after persisting an entity a lot of related inserts/updates/queries are done, but later on queries are much rarer. This could skew the affinity and force the movement of range back and forth. I can think of few ways to fix this. First one is by making affinity expire or degrade over time. Or there could be constant threshold for affinity to become important - until this threshold is reached, it's being ignored. Alternatively, it might be that priority queue approach I mentioned in part 2 makes this problem obsolete.

lcabancla commented 3 years ago

Regarding affinity, another possibility is manually assigned weights. Like, what if we want a particular query to have more importance than another (pretty common use case I assume)? The most frequent/longest queries are not necessarily the most important. Or, rather than doing it by query, what if we want a table A to have more affinity to table B than other tables? It looks to me that a big part of the challenge is to find which heuristics work and have the best bang for the buck.

tirsen commented 3 years ago

A simpler version of this might be just a static placement "hint" rather than a clever dynamic "affinity".

Something like:

CREATE TABLE orders (
 id BIGINT AUTO_INCREMENT,
 customer_id BIGINT,
 -- other columns
 PRIMARY KEY (customer_id, id),
 FOREIGN KEY (customer_id) REFERENCES customers(id),
 PLACEMENT HINT WITH ROW (customer_id) REFERENCES customers(id)
);

In this case TiDB would try to place ranges with orders on the same node as the majority of the customers of those orders. (So it's not a placement "policy" just a "hint".)

zz-jason commented 3 years ago

A simpler version of this might be just a static placement "hint" rather than a clever dynamic "affinity".

Something like:
CREATE TABLE orders (
 id BIGINT AUTO_INCREMENT,
 customer_id BIGINT,
 -- other columns
 PRIMARY KEY (customer_id, id),
 FOREIGN KEY (customer_id) REFERENCES customers(id),
 PLACEMENT HINT WITH ROW (customer_id) REFERENCES customers(id)
);
In this case TiDB would try to place ranges with orders on the same node as the majority of the customers of those orders. (So it's not a placement "policy" just a "hint".)

This method looks good to me. maybe we can try to implement this firstly.

Yisaer commented 3 years ago

This idea is great. To achieve this target, I think it needs the Scheduling(PD) to be awareness of the affinity between the given data. If that is supported, I think many workloads would be benefit from it. (eg: sysbench_update_index)

zz-jason commented 3 years ago

the performance would be even better if this optimization is combined with https://github.com/pingcap/tidb/issues/19381

pingcap / tidb

affinity based range placement #19805

Feature Request