gocrane / crane

Crane is a FinOps Platform for Cloud Resource Analytics and Economics in Kubernetes clusters. The goal is not only to help users to manage cloud cost easier but also ensure the quality of applications.
https://gocrane.io
Apache License 2.0
1.87k stars 382 forks source link

reduce crane-agent's cpu expend when node's pod more than two hundred #708

Open wolfleave opened 1 year ago

wolfleave commented 1 year ago

Describe the feature

now,when node's pod more than two hundred,crane-agent's cpu expend 2C。this is too high for a agent.

image

In this picture,crane-agent-bfcbj expend 2C when the node's pod number is 217. crane-agent-cjktf expend 893m when the node's pod number is 119. crane-agent-zcv26 expend 20m when the node's pod number is 13.

Expect

crane-agent expend less than 500m when node's pod number is 200.

chenkaiyue commented 1 year ago

Thanks for your issues, is your agent 0.9.0? We will start to fix it next week.

wolfleave commented 1 year ago

yes,crane-agent 0.9.0 . expect next week.

chenkaiyue commented 1 year ago

yes,crane-agent 0.9.0 . expect next week.

OK

chenkaiyue commented 1 year ago

image

image

Last week, we conducted a performance analysis of the crane-agent and found that most of the performance consumption is in the advisor, which is used to obtain indicators related to the pod. Currently, we have collected a very large number of indicators, although only CPU and memory related indicators are used as the watermark, and these tasks consume a lot of CPU. Regarding the comparison with 0.5.0, the current version adds a lot of collected indicators, which consumes relatively more resources;

We also discussed the solution to this problem. Next, we will consider which indicators to collect as configurable items to avoid currently collecting many indicators which not be used in the subsequent process. This will reduce indicator collection while meeting user needs.

Currently, we recommend that you reduce the collection interval of the indicator to 5 seconds, which can be adjusted to 60 seconds by using the --collect-interval parameter. In addition, you can see what functions you currently use. If you only use the watermark function, you can close the noderesource manager and podresource manager through the feature gates of NodeResource and CranePodResource, which are currently enabled by default.

I tested on a node with 230 pods, after adjusting the interval to 60 seconds, it took approximately 0.5 C.