tikv / pd

Placement driver for TiKV
Apache License 2.0
1.05k stars 719 forks source link

Tracking issue for self-adaptive rate limit in self protection #7167

Open CabinfeverB opened 1 year ago

CabinfeverB commented 1 year ago

Development Task

Summary: Refer to https://github.com/tikv/pd/issues/4373. We have implemented gRPC rate-limit and HTTP rate-limit by manually setting the rate-limiting parameters. But it's hard to set a speed limit. Because the carrying capacity of different clusters is different, and the carrying capacity of different interfaces in different loads is also different.

So we should provide a mechanism to set rate limit adaptively.

Here are some references for applying the tcp BBR algorithm to system traffic limiting scenarios. https://github.com/alibaba/Sentinel/wiki/%E7%B3%BB%E7%BB%9F%E8%87%AA%E9%80%82%E5%BA%94%E9%99%90%E6%B5%81 https://github.com/go-kratos/aegis/blob/main/ratelimit/bbr/bbr.go

Goal

  1. Adaptive rate limiting for a single API: For the hot path API, when the processing speed of an API reaches the bottleneck, speed-limiting measures are taken to prevent the OOM/CPU overload caused by request accumulation.
  2. Adaptive service degradation. Carding API priority, when achieving the bottleneck of high priority API drop low priority API with tighter speed limit configurations to improve overall availability PD.

Tasks:

mics

yzhan1 commented 2 weeks ago

@CabinfeverB Does this issue include adding rate-limit to TCP connection creations?

We've seen an issue where the follower keeps creating short-living HTTPs connections to the leader so the leader spent 60%+ CPU for SSL handshakes and was not able to serve TSO and other things.

While that particular issue has been fixed by using keep-alive connections from follower to leader, but this connection storm issue can still happen if say one TiDB/TiKV host went bad and kept creating new connections, and this could still bring down the PD or significantly increase the latency.

CabinfeverB commented 1 week ago

Cc @nolouch

nolouch commented 1 week ago

It is not included; this is only rate-limiting for the request. cc @rleungx @niedhui