Yisaer commented 4 years ago

Description

Describe the feature you'd like: TiDB-Operator is going to support Auto-scaling feature for TiKV and TiDB in TidbCluster.

Auto-scaling would help the Operator users to auto-scale out/in for the TiKV and TiDB for the TiDBCluster by the metrics / value / resources which the users could provide. This issue is to discuss and track the whole process of the Auto-scaling design and realization.

First, I think there are some prerequisites to the Auto-Scaling for TidbClusters:

The Customized Ability for the Auto-scaling Algorithm. (must to have) TidbCluster is sensitive to the scaling as the distributed databases, we should have the ability to control the whole auto-scaling process.
The Metrics configurations should be extendable. (must to have) Currently, the Auto-scaling need the TidbCluster metrics info to decide the recommended numbers of TiKV and TiDB. In the future, the Platform information (Kubernetes) or the external global metrics/values are also necessary.
The Interval duration Control of the Auto-scaling should be Cluster Level. (nice to have) Operator could manager several clusters in one Kubernetes. We should provide the cluster level control ability and Interval duration is also important to avoid performance jitter.

Auto-Scaling Design To support this feature and meet the prerequisites, the Auto-scaling is designed to create one new API (TidbClusterAutoScaler) and one new Controller AutoScaler Controller.

TidbClusterAutoScaler is kind of like HPA. The Operator users could use it to auto-scale in/out the TidbCluster by their own demands configured in the TidbClusterAutoScaler Spec.

The AutoScaler Controller would watch the TidbClusterAutoScaler and reconcile it to adjust the replicas in TidbCluster.

TODO List

P0
- [x] Update CRD for tidbautoscaler #3156
- [x] Defaulting and validation for tidbautoscaler CR #3157
- [x] sync tidbautoscaler when spec.xxxx.external is configured #3158
- [ ] Sync tidbautoscaler with PD API #3159

Workload Estimation (P0 features)

45

Time

GanttStart: 2020-07-13 GanttDue: 2020-09-30

Documentations

Project

Yisaer commented 4 years ago

We welcome everyone to help to realize the Auto-scaling feature in Operator by discuss/suggest/code review/pull request/ etc.

Yisaer commented 4 years ago

Currently, the auto-scaling is under alpha feature. It only provides the ability below:

basic auto-scaling Availability
basic guarantee to avoid jitter during auto-scaling.

To achieve Production ready:

control the scaling step for auto-scaling (need discussed)
record the consecutive count timestamp in auto-scaling (need discussed)
fetch store info from pdapi .
skip the consecutive count control if the auto-scaling results are the same between 2 times. ( need discussed)
add timeout for prometheus query.
make filterTidbInstance compatible with tidb failover
noise reduction.

Yisaer commented 4 years ago

Here describes How the auto-scaling algorithm work by average CPU load.

1722

Yisaer commented 4 years ago

After #1731 merged, we would have auto-scaling ability based by cpu load feature under alpha stage. Currently, there are still plenty of jobs to do:

Add Syncing TidbClusterAutoCluster Status
Add proper events and logs
Design for noise reduction for auto-scaling process
Support specific pd label for the auto-scaling out tikv instances.
Unit test and e2e test for auto-scaling
Support TidbMonitor

Yisaer commented 4 years ago

There are several good first issue about auto-scaling`, we are welcome the newcomers to join the contribution by assign these tasks.

Ref:

1751

1752

1753

Yisaer commented 4 years ago

Syncing the replicas between online configuration and local configuration is always the problem after we use autoscaler ( or hpa）, this issue request the new feature to solve this problem. https://github.com/pingcap/tidb-operator/issues/1818

Yisaer commented 4 years ago

To improve the user-experience, we should enhance the information by executing

kubectl get tidbclusterautoscaler

Ref: https://github.com/pingcap/tidb-operator/issues/1820

Yisaer commented 4 years ago

Currently, we have released Auto-scaling as an alpha feature in operator 1.1 version which based on the cpu load. After that, we would start to focus on the following 3 things:

support more kinds of metrics in auto-scaling
The noise reduction for auto-scaling
For tikv and tidb auto-scaling, we would try to use heterogeneous design instead of mutation webhook.

The e2e test should also be completed.

Yisaer commented 4 years ago

We are happy to announce that the auto-scaling is going to have the external strategy ability that exposes the http interface to let community user could use their own auto-scaling strategy (like predicting strategy by AI) to affect to tidbcluster auto-scaling.

For more detail, see: https://github.com/pingcap/tidb-operator/pull/2279

DanielZhangQD commented 4 years ago

UCP:
- P1:
- noise reduction https://github.com/pingcap/tidb-operator/pull/2307
- noise reduction for tidb
- control the scaling step for auto-scaling https://github.com/pingcap/tidb-operator/issues/2372
Non-UCP:
- The replicas updated by auto-scaler may be overwritten by applying local yaml https://github.com/pingcap/tidb-operator/issues/1818
- fetch store info from pdapi
- make filterTidbInstance compatible with tidb failover
- support more kinds of metrics in auto-scaling
- For tikv and tidb auto-scaling, we would try to use heterogeneous design instead of mutation webhook.

pingcap / tidb-operator

[Feature] AutoScaling For TidbCluster #1651

Description

Category

TODO List

Workload Estimation (P0 features)

Time

Documentations

Project

1722

1751

1752

1753