tikv / pd

Placement driver for TiKV
Apache License 2.0
1.05k stars 721 forks source link

Add a special scheduler for evict leader with timeout #2782

Open nolouch opened 4 years ago

nolouch commented 4 years ago

Feature Request

Describe your feature request related problem

To upgrade the TiKV cluster, we will use evict-leader-scheduler to ensure the restart TiKV has no leader. but we encountered the problem that the evict-leader-scheduler was not deleted many times during the rolling upgrade process. In order to better solve the problem, we can provide a special evict leader scheduling with a timeout for the deployment tool.

Describe the feature you'd like

Add a special scheduler for evict leader with timeout

Describe alternatives you've considered

the timeout should suitable in most cases.

Teachability, Documentation, Adoption, Migration Strategy

BusyJay commented 4 years ago

Timeout is unpredictable. No one knows what's an appropriate value. I suggest to use a steaming grpc call or long http connection. The evict-leader-scheduler is added once the call/connection is established, and removed once the call/connection is aborted or finished.

disksing commented 4 years ago

I'd like to propose another approach. Instead of depending on a special scheduler, we can introduce a new store state. Let's just call it UNLOAD. With the state, suppose we want to upgrade a tikv node:

  1. We use pd-ctl or pd API to set the node's state to UNLOAD.
  2. PD creates operators to transfer all leaders out of the store (just like evict-leader, but without creating the scheduler).
  3. After the leader count becomes 0, PD changes the node's state to UNLOADED.
  4. User restart / upgrade the TiKV node.
  5. When PD receives putStore command from the TiKV node, it updates the node's state to Up.
BusyJay commented 4 years ago

@disksing I suggested a similar solution in chat. However this doesn't solve the case that user abort the operations, in which case TiKV doesn't have to be restarted.

The solution I suggested above doesn't require a new scheduler and also work in all known cases.

3pointer commented 4 years ago

BR also meet the same problem, BR will temporary remove balance-region-scheduler, balance-leader-scheduler ... to speed up restoration. and finally add these schedulers back, but if BR was killed during restoration. these schedulers would lost, so BR need PD to provide the ability of temporary remove schedulers, such as remove scheduler ttl option

kennytm commented 4 years ago

Currently PD supports the following TTL-based API:

This issues requests a TTL-based API to:

BR requests 3 TTL-based APIs to:

Question: what to do when multiple services require conflicting settings? In GC-TTL the conflict resolution is simple: just set the safepoint to min of all alive services. But for the new APIs... say, service A registers to remove scheduler X, and service B registers to add the same scheduler X, how should this be resolved?

I see two solutions for now:

  1. Select only some specific schedulers and configs, with a clear direction of resolution, e.g. evict-leader-scheduler can only be registered to be added, not removed; the balance-* schedulers can only be removed, not added; max-merge-region-size can only be decreased, not increased, etc.

  2. First-come-first-serve: while a service TTL for a particular scheduler/config is alive, no other services can register TTL to the same scheduler/config.

We also need to consider the interaction with the existing dynamic (permanent) changes. For instance, if a service has registered to set max-snapshot-count to 40, what effect we get if we run

3pointer commented 4 years ago

I think the First-come-first-serve solution is better. for two reasons

  1. Not all config has a clear direction. e.g. {leader,region}-schedule-limit
  2. TTL logic is simple. and we can make TTL based on service not config. if service A has registered to PD with in TTL, then PD will deny all other services' requests.
kennytm commented 4 years ago

For removing schedulers we could use the "Pause" API (#1831), which is available on 3.1 and 4.0.