kubernetes-sigs / metrics-server

Scalable and efficient source of container resource metrics for Kubernetes built-in autoscaling pipelines.
https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/
Apache License 2.0
5.81k stars 1.87k forks source link

Consider rewriting scraper goroutine scheduling #778

Open serathius opened 3 years ago

serathius commented 3 years ago

Metrics Server scraper is responsible for creating paralleling metric collection. Every cycle it lists all the nodes in the cluster and creates goroutine to scrape each node. Each go-routine waits some random time (to avoid scrapping at the same time), collects metrics and then is removed. Problems with this approach:

What would you like to be added:

Instead of listing nodes every cycle and churning goroutines, we should maintain a goroutine per node and use informer event handler to add/remove goroutines as nodes are updated. Similar approach is used by Prometheus Server.

Things to consider:

Why is this needed:

I have listed some problems, but we should test it properly before committing to it. We should create benchmarks and validate scraper and storage performance before merging any code, as in worst case we could result in more complicated code without addressing any of the problems.

/cc @yangjunmyfm192085 @dgrisonnet

/kind feature

yangjunmyfm192085 commented 3 years ago

Sounds good, We really need to consider and test properly before committing, and so when we plan to start it? After https://github.com/kubernetes-sigs/metrics-server/pull/777 merged?

serathius commented 3 years ago

I don't think this is nesesery blocked by #777 as they are pretty independent changes. I think we can start to do performance testing.

yangjunmyfm192085 commented 3 years ago

ok, Let me take a look, and try to do it?

dgrisonnet commented 3 years ago

How storage would consume partial data, decide to remove old one and performance overhead caused by that

I don't think this is possible with the current storage implementation. To throw a suggestion to work around that, maybe we could have 1 store per node. I don't really see any drawbacks to that, except that getting PodMetrics will become more complex.

How to test code and prevent leaking goroutines

We could check for leaks with: https://github.com/uber-go/goleak Also, I don't know if metrics-server registers the client_golang go collector, but it could give us insights regarding the number of goroutines that are currently running with the go_goroutines metrics.

yangjunmyfm192085 commented 3 years ago

/assign

yangjunmyfm192085 commented 3 years ago

I am working on this issue recently, I have a few questions

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

dgrisonnet commented 3 years ago

/remove-lifecycle rotten /lifecycle frozen

serathius commented 2 years ago

This should implement Redesign goroutine management to reduce scheduling cost (will require redesigning storage) point from https://github.com/kubernetes-sigs/metrics-server/issues/857

jasonliu747 commented 2 years ago

/remove-lifecycle frozen @yangjunmyfm192085 Hi, thanks for your contribution, any update on this issue?

serathius commented 2 years ago

/lifecycle frozen /cc @shuaich

yangjunmyfm192085 commented 2 years ago

Sorry, I delayed this issue due to lack of time and other things. I modified it at https://github.com/yangjunmyfm192085/metrics-server/tree/rewriting-scraper but not submit pr. I will submit the pr as soon as possible, and need help me to review whether the modification is reasonable. If there is no problem, I will do a comparison test @serathius. Of course, @shuaich is also welcome to participate