Open CabinfeverB opened 1 year ago
@CabinfeverB Does this issue include adding rate-limit to TCP connection creations?
We've seen an issue where the follower keeps creating short-living HTTPs connections to the leader so the leader spent 60%+ CPU for SSL handshakes and was not able to serve TSO and other things.
While that particular issue has been fixed by using keep-alive connections from follower to leader, but this connection storm issue can still happen if say one TiDB/TiKV host went bad and kept creating new connections, and this could still bring down the PD or significantly increase the latency.
Cc @nolouch
It is not included; this is only rate-limiting for the request. cc @rleungx @niedhui
Development Task
Summary: Refer to https://github.com/tikv/pd/issues/4373. We have implemented gRPC rate-limit and HTTP rate-limit by manually setting the rate-limiting parameters. But it's hard to set a speed limit. Because the carrying capacity of different clusters is different, and the carrying capacity of different interfaces in different loads is also different.
So we should provide a mechanism to set rate limit adaptively.
Here are some references for applying the tcp BBR algorithm to system traffic limiting scenarios. https://github.com/alibaba/Sentinel/wiki/%E7%B3%BB%E7%BB%9F%E8%87%AA%E9%80%82%E5%BA%94%E9%99%90%E6%B5%81 https://github.com/go-kratos/aegis/blob/main/ratelimit/bbr/bbr.go
Goal
Tasks:
[ ] Implement BBR algorithm for a single API
[ ] Adaptive service degradation
mics