kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
109.31k stars 39.14k forks source link

ImageLocality plugin should fetch the full list of images available on the nodes to compute an accurate scoring during scheduling #117346

Closed tiraboschi closed 5 months ago

tiraboschi commented 1 year ago

What would you like to be added?

The scheduler ImageLocality plugin computes a score for each node based on the information in the scheduler cache. This is used to favor nodes that already have the container images required for the Pod that is going to be scheduled.

The point is that the scheduler cache tracks only the images that are reported for the node on node.Status.Images https://github.com/kubernetes/kubernetes/blob/c1bbf791dd22e94aaa879f473ec9f10a24da0ec9/pkg/scheduler/internal/cache/cache.go#L684-L689

but the list of images reported on node status (ordered by descending image size) is capped up to nodeStatusMaxImages which is a kubelet configuration parameter (default: 50) https://github.com/kubernetes/kubernetes/blob/c1bbf791dd22e94aaa879f473ec9f10a24da0ec9/pkg/kubelet/nodestatus/setters.go#L415-L433

So the ImageLocality plugin is going to take a decision based on a partial view of the whole context.

Why is this needed?

According to external factors, some node could have used in the past larger images that are higher in the list so a "small" image available on many nodes could be reported under node.Status.Images only by a few nodes. In this case the ImageLocality plugin is going to favor nodes that are reporting the images over nodes that are not reporting it due to external reasons and this could lead to an unbalanced schedule (at least till the ImageLocality plugin is relevant over others).

In the KubeVirt use case for instance a single image (the virt-launcher one containing libvirt, qemu... ) is used by all the VMI pods (what makes them different is the guest OS disk probably contained in a PV) but not all the KubeVirt involved worker nodes are always reporting it and this could lead the ImageLocality plugin to introduce a wrong weight.

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tiraboschi commented 1 year ago

/sig sig-scheduling

k8s-ci-robot commented 1 year ago

@tiraboschi: The label(s) sig/sig-scheduling cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/kubernetes/kubernetes/issues/117346#issuecomment-1508291973): >/sig sig-scheduling Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tiraboschi commented 1 year ago

/sig scheduling

AxeZhan commented 1 year ago

will #116938 help?

tiraboschi commented 1 year ago

will #116938 help?

No, I don't think so: with that PR the image list in the scheduler cache will be refreshed earlier/more often. But here the root cause is that the size ordered list of images reported by each node is capped (up to nodeStatusMaxImages, default 50) and so "small" images are never going to be reported in the node status. This does not depend on how often the cache will be refreshed.

At the end, having just a subset of the nodes reporting an image that instead is present on all of them will lead the ImageLocality plugin to put an unfair weight preferring the nodes that are actually reporting the image (this is what we really know) which (due to the capping of the list) is just a subset of the nodes that really have that image.

AxeZhan commented 1 year ago

Ah I see now. I think we need to bring sig-node to this issue, as node.Status.Images is dealed in kubelet. /sig node

smarterclayton commented 1 year ago

At some level, the scheduler already sees the images that are requested on nodes over time. We capped the number of images that showed up in node status because it was too big and caused a lot of write load on etcd. Instead of adding more images there, we could:

  1. Show different images (the ones not referenced by pods directly) and have the scheduler infer the pod images from its own caches and history of scheduling
  2. Provide a different mechanism for the node to report its images that does not couple to the Node object and scales cleanly with the images

Some of the history of node.Status.Images also depends on layer reuse, and may not apply to all runtimes (WASM, VM) on the node. A more flexible mechanism might also give users room to indicate which nodes have significant cached resources that a workload might want to leverage.

fabiand commented 1 year ago

WRT #1 - It can surely be optimized to decide what images to show, however, in the end I still believe that we need to cap the number of images. WRT #2 - regardless of the mechanism, won't we want to continue to limit the number of images in order to protect the scheduler and avoid dosing?

What about the ability of only conditionally using the image locality plugin as long as we are not at the limit? If we are at the limit, then we simply don't have complete knowledge, and the plugin should not have a weight in the decision.

fabiand commented 1 year ago

@smarterclayton can you help to identify the next steps of how we can move this forward?

ffromani commented 1 year ago

Hey, chiming in from the scheduler side. Would it help to:

  1. keep reporting a capped list of images exactly like today (backward compat, cap the load on etcd)
  2. report an additional non-cryptographic hash of the remaining/all the existing images on node

The core idea here is that a single hash should be good enough approximation of the set of images on a node We are doing something similar for numa-aware scheduling, which is however a out-of-tree component, with good results so far.

HTH

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kerthcet commented 6 months ago

From the POV of scheduling, the score of ImageLocality is strongly related to the image size, the providing the top 50(by default) images is as expected.

some node could have used in the past larger images that are higher in the list

This is somehow more related to the node side, like LRU, for scheduler, since the image exists then we should consider it.

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 5 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/kubernetes/issues/117346#issuecomment-2016724398): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.