k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
26.62k stars 2.24k forks source link

Allow setting `priorityClassName` on ServiceLB daemonset. #10033

Open josephshanak opened 2 weeks ago

josephshanak commented 2 weeks ago

Is your feature request related to a problem? Please describe. I would like to set priorityClassName on all of the pods in my cluster so I can control the order in which they preempted. The pods created by ServiceLB daemonsets do not have a priorityClassName so they receive the default priority of 0, which is lower than other priority classes I have defined. This means these pods will likely be preempted when the cluster is over-committed.

Describe the solution you'd like I would like the ability to set a priorityClassName on the pods created by ServiceLB / k3s: https://github.com/k3s-io/k3s/blob/94e29e2ef5d79904f730e2024c8d1682b901b2d5/pkg/cloudprovider/servicelb.go#L481-L512

Perhaps via a commandline option --servicelb-priority-class=my-priority-class.

Describe alternatives you've considered

  1. I could use a Priority Class with globalDefault: true to define a global default. However, this means pods without a priorityClassName will be scheduled with the same priority, which is not ideal because it priorityClassName could be forgotten.

  2. I could create priority classes with negative values (per https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass this should be fine), and use the global default for ServiceLB pods only (however, this is not ideal for the same reason above).

  3. k3s could create the pods with system-cluster-critical or system-node-critical priority classes.

  4. I could disable ServiceLB with --disable=servicelb and install another load balancer provider like MetalLB, which seems to support priorityClassName (https://github.com/metallb/metallb/issues/995).

ChristianCiach commented 2 weeks ago

You could probably also use a mutating admission controller like Kyverno to modify the pod-spec based on custom rules. See: https://kyverno.io/docs/writing-policies/mutate/

This is surely not an attractive option, but it's a possibility nonetheless.

brandond commented 2 weeks ago

Seems reasonable. See the linked PR.

josephshanak commented 2 weeks ago

PR looks good to me! And an annotation seems much more flexible!

brandond commented 1 week ago

The pods created by ServiceLB daemonsets do not have a priorityClassName so they receive the default priority of 0, which is lower than other priority classes I have defined.

I will note that the svclb pods have no requests or reservations and consume basically no resources since all they just go to sleep after adding iptables rules.

root@k3s-server-1:~# kubectl top pod -n kube-system
NAME                                      CPU(cores)   MEMORY(bytes)
coredns-6799fbcd5-zxktb                   2m           13Mi
local-path-provisioner-6c86858495-dpfb6   1m           6Mi
metrics-server-54fd9b65b-9xqxs            5m           21Mi
svclb-traefik-49baafe9-xwvrd              0m           0Mi
traefik-7d5f6474df-hfhwd                  1m           26Mi

This means these pods will likely be preempted when the cluster is over-committed.

Are you actually seeing the svclb pods get preempted, or is this a theoretical problem?

josephshanak commented 1 week ago

The pods created by ServiceLB daemonsets do not have a priorityClassName so they receive the default priority of 0, which is lower than other priority classes I have defined.

I will note that the svclb pods have no requests or reservations and consume basically no resources since all they just go to sleep after adding iptables rules.

root@k3s-server-1:~# kubectl top pod -n kube-system
NAME                                      CPU(cores)   MEMORY(bytes)
coredns-6799fbcd5-zxktb                   2m           13Mi
local-path-provisioner-6c86858495-dpfb6   1m           6Mi
metrics-server-54fd9b65b-9xqxs            5m           21Mi
svclb-traefik-49baafe9-xwvrd              0m           0Mi
traefik-7d5f6474df-hfhwd                  1m           26Mi

This means these pods will likely be preempted when the cluster is over-committed.

Are you actually seeing the svclb pods get preempted, or is this a theoretical problem?

This is theoretical. I have not experienced this. I came upon this while attempting to assign priority classes to all pods.