pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.24k stars 499 forks source link

[Feature] AutoScaling For TidbCluster #1651

Open Yisaer opened 4 years ago

Yisaer commented 4 years ago

Description

Describe the feature you'd like: TiDB-Operator is going to support Auto-scaling feature for TiKV and TiDB in TidbCluster.

Auto-scaling would help the Operator users to auto-scale out/in for the TiKV and TiDB for the TiDBCluster by the metrics / value / resources which the users could provide. This issue is to discuss and track the whole process of the Auto-scaling design and realization.

First, I think there are some prerequisites to the Auto-Scaling for TidbClusters:

Auto-Scaling Design To support this feature and meet the prerequisites, the Auto-scaling is designed to create one new API (TidbClusterAutoScaler) and one new Controller AutoScaler Controller.

TidbClusterAutoScaler is kind of like HPA. The Operator users could use it to auto-scale in/out the TidbCluster by their own demands configured in the TidbClusterAutoScaler Spec.

The AutoScaler Controller would watch the TidbClusterAutoScaler and reconcile it to adjust the replicas in TidbCluster.

Category

Auto-Scaling

TODO List

Workload Estimation (P0 features)

45

Time

GanttStart: 2020-07-13 GanttDue: 2020-09-30

Documentations

Project

Yisaer commented 4 years ago

We welcome everyone to help to realize the Auto-scaling feature in Operator by discuss/suggest/code review/pull request/ etc.

Yisaer commented 4 years ago

Currently, the auto-scaling is under alpha feature. It only provides the ability below:

  1. basic auto-scaling Availability
  2. basic guarantee to avoid jitter during auto-scaling.

To achieve Production ready:

  1. control the scaling step for auto-scaling (need discussed)
  2. record the consecutive count timestamp in auto-scaling (need discussed)
  3. fetch store info from pdapi .
  4. skip the consecutive count control if the auto-scaling results are the same between 2 times. ( need discussed)
  5. add timeout for prometheus query.
  6. make filterTidbInstance compatible with tidb failover
  7. noise reduction.
Yisaer commented 4 years ago

Here describes How the auto-scaling algorithm work by average CPU load.

1722

Yisaer commented 4 years ago

After #1731 merged, we would have auto-scaling ability based by cpu load feature under alpha stage. Currently, there are still plenty of jobs to do:

  1. Add Syncing TidbClusterAutoCluster Status
  2. Add proper events and logs
  3. Design for noise reduction for auto-scaling process
  4. Support specific pd label for the auto-scaling out tikv instances.
  5. Unit test and e2e test for auto-scaling
  6. Support TidbMonitor
Yisaer commented 4 years ago

There are several good first issue about auto-scaling`, we are welcome the newcomers to join the contribution by assign these tasks.

Ref:

1751

1752

1753

Yisaer commented 4 years ago

Syncing the replicas between online configuration and local configuration is always the problem after we use autoscaler ( or hpa), this issue request the new feature to solve this problem. https://github.com/pingcap/tidb-operator/issues/1818

Yisaer commented 4 years ago

To improve the user-experience, we should enhance the information by executing

kubectl get tidbclusterautoscaler

Ref: https://github.com/pingcap/tidb-operator/issues/1820

Yisaer commented 4 years ago

Currently, we have released Auto-scaling as an alpha feature in operator 1.1 version which based on the cpu load. After that, we would start to focus on the following 3 things:

  1. support more kinds of metrics in auto-scaling
  2. The noise reduction for auto-scaling
  3. For tikv and tidb auto-scaling, we would try to use heterogeneous design instead of mutation webhook.

The e2e test should also be completed.

Yisaer commented 4 years ago

We are happy to announce that the auto-scaling is going to have the external strategy ability that exposes the http interface to let community user could use their own auto-scaling strategy (like predicting strategy by AI) to affect to tidbcluster auto-scaling.

For more detail, see: https://github.com/pingcap/tidb-operator/pull/2279

DanielZhangQD commented 4 years ago