tikv / pd

Placement driver for TiKV
Apache License 2.0
1.04k stars 717 forks source link

Increase the probability of balancing new regions #2524

Open nolouch opened 4 years ago

nolouch commented 4 years ago

Feature Request

Describe your feature request related problem

There is an issue about the new region may have a low probability to be balanced,such as:

                             8
                         7   7
                     6   6   6  
                 5   5   5   5
             4   4   4   4   4
         3   3   3   3   3   3
     2   2   2   2   2   2   2 
 1   1   1   1   1   1   1   1 
t1  t2  t3  t4  t5  t6  t7  t8 

such as the above drawing, the y-axis represents the existing regions at that moment. region 8 only appears at time t8, the probability of R8 being selected in the entire sample space is very small.
this issue will lead to a scenario: import a new table into a big cluster, the region of this table will be unevenly distributed, which will bring some hot store issues.

Describe alternatives you've considered

Empirically, We hope that the distribution of data is relatively random.

Teachability, Documentation, Adoption, Migration Strategy

lhy1024 commented 4 years ago

The first proposal:

We put new regions with TTL to the fresh set and pick them with more priority. Its disadvantage is that it is difficult for us to give the probability between the fresh set and the other one.

lhy1024 commented 4 years ago

The second proposal:

Suppose that we expect all regions have the same probability of being scheduled in the past x period. And that the total number of regions at the current time is m.

When placing a new region with a TTL, which will live x period, the probability of being scheduled ( TTL + 1) /m

Its disadvantage is that when the difference between x and m is too large, there is not a big difference in their probability

Of course, we can choose non-linear reduction for TTL weight, such as (F(TTL)+1)/m

dimstars commented 4 years ago

I tested a solution: The total online time of the system t1 minus the survival time of each region t2 is used to represent the scheduling weight w, and the sum of all region weights w is sumt. Each region is selected with a probability of w/sumt.

The figure below shows the comparison of the probability between the three modifications and the original scheme after 1000 cycles (a new region will be generated in each cycle). The strategies they choose for each cycle are as follows: Original scheme: all regions are in no particular order with equal probability. w: The w/sumt scheme above. w^2: In order to further reflect the urgency of choosing a new region, w^2 is used instead of w for calculation. w^10: Calculate w^10 instead of w. image

It can be seen that the "w^10" scheme can greatly reduce the probability gap of being selected in most regions.

dimstars commented 4 years ago

Consider the following:

  1. It should be that the probability of all regions being selected is basically consistent within a certain period of time x. It is not appropriate to use the total online time of the system t as a reference, because old regions are difficult to be selected as time goes by.
  2. Regions may be scheduled multiple times in a short time, so the amount of probability increase needs to be weighed.
  3. The selected region may not be successfully dispatched. How to deal with this situation?
rleungx commented 4 years ago
  1. It should be that the probability of all regions being selected is basically consistent within a certain period of time x. It is not appropriate to use the total online time of the system t as a reference, because old regions are difficult to be selected as time goes by.

I think for a newly created region, we can increase the probability of being selected for that region through a kind of mechanism like TTL. But for the regions which have already existed in the system, we can still use the original way. There is no need to use the time-related way as a reference for all regions.

  1. Regions may be scheduled multiple times in a short time, so the amount of probability increase needs to be weighed.

I am not very clear about what you mean here. Would you like to explain more?

  1. The selected region may not be successfully dispatched. How to deal with this situation?

Maybe one possible way is to reset the TTL? @lhy1024

BTW, would you like to join our slack channel?

dimstars commented 4 years ago

I think for a newly created region, we can increase the probability of being selected for that region through a kind of mechanism like TTL. But for the regions which have already existed in the system, we can still use the original way. There is no need to use the time-related way as a reference for all regions.

Looks good. Maybe this is easier to implement. Calculating probabilities based on time is really hard to control. Because there will be many uncertain factors, such as scheduling failure and the irregular generation of new regions.

I am not very clear about what you mean here. Would you like to explain more?

This is the question we raised in our discussion. If the probability of a new region is excessively increased, the new region may be selected multiple times, leading to frequent scheduling. But if you don't use the "calculating probabilities" approach, there should be some mechanisms to avoid this.

BTW, would you like to join our slack channel?

Sure.

rleungx commented 4 years ago

This is the question we raised in our discussion. If the probability of a new region is excessively increased, the new region may be selected multiple times, leading to frequent scheduling. But if you don't use the "probability" approach, there should be some mechanisms to avoid this.

The reason why this problem existed is that we want to avoid the hot spot problem. Just as @nolouch said, "Import a new table into a big cluster, the region of this table will be unevenly distributed, which will bring some hot store issues", once we have selected a new region to balance, maybe that region can be removed from the new region set.

lhy1024 commented 4 years ago

Maybe one possible way is to reset the TTL?

@rleungx What does it mean, can you explain it?

lhy1024 commented 4 years ago

But for the regions which have already existed in the system, we can still use the original way.

I think it may be a good idea!

rleungx commented 4 years ago

Maybe one possible way is to reset the TTL?

@rleungx What does it mean, can you explain it?

If we have selected a region and create an operator for it, but the operator is timeout due to some reason. Do we need to reset/refresh the TTL?

dimstars commented 4 years ago

If we put a new region with TTL into a new region set and select them first. For the regions that already exist, the selection still follows the original method. When the new region is successfully dispatched, it is removed from the collection.

For different generation rates of regions and the probability of preferential selection of new regions, the following situations may occur:

  1. Region is generated rapidly, and the new region set is getting larger and larger. Periodically removing older regions from the new set solves this problem. (TTL is needed here)
  2. New Regions remained at a stable number. This is the ideal state.
  3. The collection of new regions is very small. Whenever a region is generated, it will be selected almost immediately. I'm not sure that would have a bad effect.

The disadvantage is that it's very difficult to give a probability between a new set and another set. Maybe we can experiment to get the right value, or maybe it's not a fixed value.

lhy1024 commented 3 years ago

We may need to consider adding the configuration, effect and cluster size of the strategy and some other information to the telemetry