kubernetes-sigs / descheduler

Descheduler for Kubernetes
https://sigs.k8s.io/descheduler
Apache License 2.0
4.23k stars 645 forks source link

add gomaxprocs limit, return node fit error and pod QoS in advance #1423

Open fanhaouu opened 4 weeks ago

fanhaouu commented 4 weeks ago

This PR aims to address the following three issues:

  1. The default value of GOMAXPROCS is the number of CPU cores of the machine. After GOMAXPROCS is higher than the number of truly usable cores, the Go scheduler will keep switching OS threads, which will affect the performance of descheduler program;
  2. There are many filter policies in the current node fit, similar to the filter plugin in the scheduler. The correct logic should be: if any plugin does not meet the conditions, it should be terminated in advance. The current check logic is incomprehensible, and it actually performs all filter check logic, which is not only time-consuming, but also very unnecessary.(https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/node/node.go#L107)
  3. If the pod's Qos has a value, return the QoS value early. For pods evicted within the descheduler, they are typically already scheduled onto nodes, and in most cases, the QoS value is already present on the pod's status. Therefore, it is unnecessary to start the determination process from scratch.
k8s-ci-robot commented 4 weeks ago

Hi @fanhaouu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
k8s-ci-robot commented 4 weeks ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign a7i for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubernetes-sigs/descheduler/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
fanhaouu commented 3 weeks ago

@a7i ,master, can you help me review this pr?

a7i commented 3 weeks ago

Hi @fanhaouu great contribution. Going to copy some of the maintainers for feedback as well: /cc @jklaw90 @ingvagabund @damemi

My feedback:

fanhaouu commented 3 weeks ago

Hi @fanhaouu great contribution. Going to copy some of the maintainers for feedback as well: /cc @jklaw90 @ingvagabund @damemi

My feedback:

  • regarding point 1: For GOMAXPROCS, as far as I know, we don't run any goroutines in Descheduler. How does this help?
  • regarding point 2: The idea is to present all NodeFit predicates. The predicate check is not in any particular order, so returning the first one may not present the whole picture to the cluster operator. In a cluster of 35k pods / 15k deployments, a Descheduler run takes a few (single digit) seconds. Given that the shortest frequency can be a minute, I'm not convinced this optimization is worth it. What do you think?
  • regarding point 3: I like it, that's a great change!
  • overall: if you could split this into 3 PRs, I think it would make it easier to provide feedback and request changes.

Hi @fanhaouu great contribution. Going to copy some of the maintainers for feedback as well: /cc @jklaw90 @ingvagabund @damemi

My feedback:

  • regarding point 1: For GOMAXPROCS, as far as I know, we don't run any goroutines in Descheduler. How does this help?
  • regarding point 2: The idea is to present all NodeFit predicates. The predicate check is not in any particular order, so returning the first one may not present the whole picture to the cluster operator. In a cluster of 35k pods / 15k deployments, a Descheduler run takes a few (single digit) seconds. Given that the shortest frequency can be a minute, I'm not convinced this optimization is worth it. What do you think?
  • regarding point 3: I like it, that's a great change!
  • overall: if you could split this into 3 PRs, I think it would make it easier to provide feedback and request changes.

master, thank you for your reply.

point1. At present, descheduler has some goroutine, but the number is not very large, so gomaxproc can be added or not, but the current go runtime cannot recognize the container environment, I think it is better to add it. For example, jvm runtime has long supported container environments; (As descheduler in our company has a lot of goroutine, the performance can be improved significantly after adding this restriction, so I didn't remove this submission from the pr)

point2. Because the default loop time is 5min, this time can be adjusted. The larger the cluster, the longer the nodeFit time, so the full loop time can easily exceed 5min. What if the user sets the loop time to 1 minute? Comparing one by one I think is very, very unnecessary, too time-consuming, we should be the same as the filter policy in the scheduler.

In short, if we can make the program better, faster, without affecting any functionality, I think it makes a lot of sense.

We are also developing a descheduler cache mechanism internally. Currently, descheduler pulls weight data for filtering at policy runtime, which can actually cache a good part of the data in advance, just like the cache in the scheduler. If our company is stable online, I will also contribute to our descheduler community. We can review it together then.

ingvagabund commented 3 weeks ago

As Amir mentioned would you please break the PR into three separate PRs? Some of the suggested changes deserves a dedicated discussion. Wrt. NodeFit I am in the process of composing a KEP: https://github.com/kubernetes-sigs/descheduler/issues/1421. This sounds like a good use case to include in the proposal.

fanhaouu commented 3 weeks ago

hi, masters, i have split this into 3 PRs, looking forward to your feedback:

  1. https://github.com/kubernetes-sigs/descheduler/pull/1434
  2. https://github.com/kubernetes-sigs/descheduler/pull/1435
  3. https://github.com/kubernetes-sigs/descheduler/pull/1436

/cc @a7i @jklaw90 @ingvagabund @damemi

k8s-ci-robot commented 3 days ago

PR needs rebase.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.