kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.33k stars 230 forks source link

Supporting extended resources in Kueue and DWS #2867

Open romilbhardwaj opened 3 weeks ago

romilbhardwaj commented 3 weeks ago

We are using Kueue with DWS on a GKE cluster for managing GPU instances. Our application relies on accessing /dev/fuse exposed through a daemonset that adds a extended resource smarter-devices/fuse to all nodes on the cluster.

If I try to submit a pod which requests the following resources (YAML):

    resources:
      requests:
        nvidia.com/gpu: 1
        smarter-devices/fuse: 1
      limits:
        nvidia.com/gpu: 1
        smarter-devices/fuse: 1

The ProvisionRequest fails with Provisioning Request's pods cannot be scheduled in the nodepool, affected nodepools: pool-1. This is presumably because of the smarter-devices/fuse resource is not available in the node pool.

Instead of failing, I would like Kueue/DWS to provision the node and submit the pod anyway, since once the node is spin up I expect my daemonset to take care of creating the smarter-devices/fuse resource.

Is it possible to have Kueue/DWS "ignore" certain extended resources in the pod spec?

More logs:

Versions:

Kueue: v0.8

$ kubectl version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.11-gke.1172000
alculquicondor commented 3 weeks ago

Would you still want to define quotas for smarter-devices/fuse in Kueue?

romilbhardwaj commented 3 weeks ago

No, we don't need quotas for smarter-devices/fuse. Here's an example ClusterQueue I would like to use:

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "dws-cluster-queue"
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "smarter-devices/fuse"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 10000  # Infinite quota.
      - name: "memory"
        nominalQuota: 10000Gi # Infinite quota.
      - name: "nvidia.com/gpu"
        nominalQuota: 10  # Limited quota.
      - name: "smarter-devices/fuse"
        nominalQuota: 10000  # Infinite quota.
  admissionChecks:
  - dws-prov
alculquicondor commented 3 weeks ago

We do have a field in the Kueue Configuration called excludeResourcePrefixes https://kueue.sigs.k8s.io/docs/reference/kueue-config.v1beta1/#Resources, but it currently only excludes them from quota calculations.

We could potentially reuse that field to also exclude them from the ProvisioningRequest creation. Or make it an additional option. But I lean towards not adding more configuration, to keep the API simple.

As a workaround, given that what you ask is not currently supported, you could always have a webhook to drops the resource from PodTemplates.

colinjc commented 3 weeks ago

Took the webhook approach and got this working with a small Kyverno policy -

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: mutate-dws-pod-template
  annotations:
    policies.kyverno.io/title: Remove incompatible resources from PodTemplates
    policies.kyverno.io/subject: PodTemplate
    policies.kyverno.io/description: >-
      Removes unsupported resource requests from PodTemplate manifests to allow submission to DWS queue.
spec:
  mutateExistingOnPolicyUpdate: false
  background: false
  failurePolicy: Ignore
  rules:
  - name: mutate-remove-unsupported-resources
    match:
      resources:
        kinds:
          - PodTemplate
        namespaceSelector:
          matchExpressions:
          - key: role
            operator: In
            values:
            - kueue-jobs
    mutate:
      foreach:
        - list: "request.object.template.spec.containers"
          patchesJson6902: |-
            - path: /template/spec/containers/{{elementIndex}}/resources/requests/smarter-devices~1fuse
              op: remove
            - path: /template/spec/containers/{{elementIndex}}/resources/limits/smarter-devices~1fuse
              op: remove
romilbhardwaj commented 3 weeks ago

Thanks @alculquicondor. The webhook approach would work (like @colinjc's Kyverno policy), but it would be nice if Kueue could also exclude excludeResourcePrefixes from the PodTemplate used in the ProvisioningRequest.

alculquicondor commented 3 weeks ago

/reopen

Yes, we'll add some configuration somewhere

alculquicondor commented 3 weeks ago

/assign @PBundyra

alculquicondor commented 3 weeks ago

I think we should just remove the excluded resources from the Workload objects altogether, making sure that equivalency checks still hold

alculquicondor commented 3 weeks ago

@colinjc btw, don't forget to add the same rule for request.object.template.spec.initContainers, if you have those