The GetRegions interface will return all regions of the current system at once. Before, our cluster was relatively small and the number of regions was not high, so it didn't have much impact. When the number of regions grows to a certain magnitude, it will have a significant impact on both memory and CPU for our PD. https://github.com/tikv/pd/blob/0934e641129d8bac3c5975372b2ce492b81fd9db/server/api/region.go#L326-L336
Meanwhile, the monitoring information in this section is very incomplete.
For example:
Currently, only one place in the monitoring system will record GetRegions, and it records the completed QPS. In other words, when a get-regions request fails, no relevant monitoring can be seen on Grafana.
from this PR https://github.com/tikv/pd/pull/6622 we will print a log when receive the GetRegions request(newest master)
Due to the aforementioned reasons, if the memory of PD increases, it is difficult to identify from monitoring whether it is caused by GetRegions.
I think we can do a few things to help improve the maintainability of this module:
[ ] record metrics when receive this request
[ ] Monitor the memory usage within this request.
[ ] Disable the interfaces or limit the number of regions returns, and let clients use scanning instead.
Meanwhile, I think all similar interface have this issue when the number of regions is huge
Enhancement Task
The GetRegions interface will return all regions of the current system at once. Before, our cluster was relatively small and the number of regions was not high, so it didn't have much impact. When the number of regions grows to a certain magnitude, it will have a significant impact on both memory and CPU for our PD. https://github.com/tikv/pd/blob/0934e641129d8bac3c5975372b2ce492b81fd9db/server/api/region.go#L326-L336 Meanwhile, the monitoring information in this section is very incomplete. For example:
get-regions
request fails, no relevant monitoring can be seen on Grafana.GetRegions
. I think we can do a few things to help improve the maintainability of this module:Meanwhile, I think all similar interface have this issue when the number of regions is huge