kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.04k stars 39.38k forks source link

DRA: Extra unexpected devices allocated when using 'allocationMode: All' #127554

Open klueska opened 1 day ago

klueska commented 1 day ago

What happened?

I was testing the ability to allocate both node-local resources (GPUs) along with a new network attached resource called an IMEX channel.

My setup is as follows:

With this setup, I create two ResourceClaims, one ResourceClaim template, and 2 deployments. The two resource Claims are for two distinct IMEX channels, the ResourceClaimTemplate is for all GPUs on a given node, and the 2 deployments (of 4 replicas each) consume these claims / claimtemplates across each replica in order to simulate running 2 MPI jobs across two different IMEX domains.

Here are the specs:

---
apiVersion: v1
kind: Namespace
metadata:
  name: imex-test1
---
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  namespace: imex-test1
  name: shared-imex-channel0
spec:
  devices:
    requests:
    - name: channel
      deviceClassName: imex.nvidia.com
---
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  namespace: imex-test1
  name: shared-imex-channel1
spec:
  devices:
    requests:
    - name: channel
      deviceClassName: imex.nvidia.com
---
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaimTemplate
metadata:
  namespace: imex-test1
  name: all-node-gpus
spec:
  spec:
    devices:
      requests:
      - name: all-gpus
        deviceClassName: gpu.nvidia.com
        allocationMode: All
---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: imex-test1
  name: pod0
  labels:
    app: imex-test1-pod0
spec:
  replicas: 4
  selector:
    matchLabels:
      app: pod0
  template:
    metadata:
      labels:
        app: pod0
    spec:
      containers:
      - name: ctr
        image: ubuntu:22.04
        command: ["bash", "-c"]
        args: ["trap 'exit 0' TERM; sleep 9999 & wait"]
        resources:
          claims:
          - name: gpus
          - name: imex-channel
      resourceClaims:
      - name: gpus
        resourceClaimTemplateName: all-node-gpus
      - name: imex-channel
        resourceClaimName: shared-imex-channel0
---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: imex-test1
  name: pod1
  labels:
    app: imex-test1-pod1
spec:
  replicas: 4
  selector:
    matchLabels:
      app: pod1
  template:
    metadata:
      labels:
        app: pod1
    spec:
      containers:
      - name: ctr
        image: ubuntu:22.04
        command: ["bash", "-c"]
        args: ["trap 'exit 0' TERM; sleep 9999 & wait"]
        resources:
          claims:
          - name: gpus
          - name: imex-channel
      resourceClaims:
      - name: gpus
        resourceClaimTemplateName: all-node-gpus
      - name: imex-channel
        resourceClaimName: shared-imex-channel1

When I run this, I do not get the resource allocated to each pod in each deployment as expected (instead all pods remain pending forever).

However, If I change to explicitly requesting 1 GPU from a node instead of using allocationMode: All, things work as expected.

One thing to note is that the code linked below doesn't consider the CEL expression selector in the gpu.nvidia.com and imex.nvidia.com device classes when calculating requestData.numDevices. That might be factor in this somehow: https://github.com/kubernetes/kubernetes/blob/52095a8b7b9b75d67a3882a21a6647e4f90ade48/staging/src/k8s.io/dynamic-resource-allocation/structured/allocator.go#L176-L210

What did you expect to happen?

My expectation was that we would see 4 pods from each deployment with the following set of resources:

deployment 0, pod0: 1 GPU from a node in the same IMEX domain as all other pods in deployment 0 IMEX channel 0 from the same IMEX domain as all other pods in deployment 0

deployment 0, pod1: 1 GPU from a node in the same IMEX domain as all other pods in deployment 0 IMEX channel 0 from the same IMEX domain as all other pods in deployment 0

deployment 0, pod2: 1 GPU from a node in the same IMEX domain as all other pods in deployment 0 IMEX channel 0 from the same IMEX domain as all other pods in deployment 0

deployment 0, pod3: 1 GPU from a node in the same IMEX domain as all other pods in deployment 0 IMEX channel 0 from the same IMEX domain as all other pods in deployment 0

deployment 1, pod0: 1 GPU from a node in the same IMEX domain as all other pods in deployment 1 IMEX channel 0 from the same IMEX domain as all other pods in deployment 1

deployment 1, pod1: 1 GPU from a node in the same IMEX domain as all other pods in deployment 1 IMEX channel 0 from the same IMEX domain as all other pods in deployment 1

deployment 1, pod2: 1 GPU from a node in the same IMEX domain as all other pods in deployment 1 IMEX channel 0 from the same IMEX domain as all other pods in deployment 1

deployment 1, pod3: 1 GPU from a node in the same IMEX domain as all other pods in deployment 1 IMEX channel 0 from the same IMEX domain as all other pods in deployment 1

How can we reproduce it (as minimally and precisely as possible)?

Apply the specs listed above in a cluster with the following deviceClasses and DRA resources available...

Here are my (relevant) device classes:

---
- apiVersion: resource.k8s.io/v1alpha3
  kind: DeviceClass
  metadata:
    name: gpu.nvidia.com
  spec:
    selectors:
    - cel:
        expression: device.driver == 'gpu.nvidia.com' && device.attributes['gpu.nvidia.com'].type
          == 'gpu'
---
- apiVersion: resource.k8s.io/v1alpha3
  kind: DeviceClass
    name: imex.nvidia.com
  spec:
    selectors:
    - cel:
        expression: device.driver == 'gpu.nvidia.com' && device.attributes['gpu.nvidia.com'].type
          == 'imex-channel'

Here is the definition of one of my GPU nodes (the others look similar except for the node name):

- apiVersion: resource.k8s.io/v1alpha3
  kind: ResourceSlice
  metadata:
    name: k8s-dra-driver-cluster-worker-gpu.nvidia.com-dpwj8
  spec:
    devices:
    - basic:
        attributes:
          architecture:
            string: Ampere
          brand:
            string: Nvidia
          cudaComputeCapability:
            version: 8.0.0
          cudaDriverVersion:
            version: 12.6.0
          driverVersion:
            version: 560.35.3
          index:
            int: 0
          minor:
            int: 7
          productName:
            string: NVIDIA A100-SXM4-40GB
          type:
            string: gpu
          uuid:
            string: GPU-b1028956-cfa2-0990-bf4a-5da9abb51763
        capacity:
          memory: 40Gi
      name: gpu-0
    driver: gpu.nvidia.com
    nodeName: k8s-dra-driver-cluster-worker
    pool:
      generation: 0
      name: k8s-dra-driver-cluster-worker
      resourceSliceCount: 1

Here is the definition of my 2 IMEX channel resourceSlices:

---
apiVersion: v1
items:
- apiVersion: resource.k8s.io/v1alpha3
  kind: ResourceSlice
  metadata:
    name: imex-domain-0f884867-ba2f-4294-9155-b495ff367eea-1
  spec:
    devices:
    - basic:
        attributes:
          channel:
            int: 0
          type:
            string: imex-channel
      name: imex-channel-0
    - basic:
        attributes:
          channel:
            int: 1
          type:
            string: imex-channel
      name: imex-channel-1
    ...
    driver: gpu.nvidia.com
    nodeSelector:
      nodeSelectorTerms:
      - matchExpressions:
        - key: nvidia.com/gpu.clusteruuid
          operator: In
          values:
          - 0f884867-ba2f-4294-9155-b495ff367eea
        - key: nvidia.com/gpu.cliqueid
          operator: In
          values:
          - "1"
    pool:
      generation: 0
      name: imex-domain-0f884867-ba2f-4294-9155-b495ff367eea-1
      resourceSliceCount: 1
---
apiVersion: v1
items:
- apiVersion: resource.k8s.io/v1alpha3
  kind: ResourceSlice
  metadata:
    name: imex-domain-0f884867-ba2f-4294-9155-b495ff367eea-2
  spec:
    devices:
    - basic:
        attributes:
          channel:
            int: 0
          type:
            string: imex-channel
      name: imex-channel-0
    - basic:
        attributes:
          channel:
            int: 1
          type:
            string: imex-channel
      name: imex-channel-1
    ...
    driver: gpu.nvidia.com
    nodeSelector:
      nodeSelectorTerms:
      - matchExpressions:
        - key: nvidia.com/gpu.clusteruuid
          operator: In
          values:
          - 0f884867-ba2f-4294-9155-b495ff367eea
        - key: nvidia.com/gpu.cliqueid
          operator: In
          values:
          - "2"
    pool:
      generation: 0
      name: imex-domain-0f884867-ba2f-4294-9155-b495ff367eea-2
      resourceSliceCount: 1

Anything else we need to know?

No response

Kubernetes version

```console $ kubectl version Client Version: v1.31.1 Kustomize Version: v5.4.2 Server Version: v1.31.0 ```

Cloud provider

NONE

OS version

No response

Install tools

No response

Container runtime (CRI) and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

k8s-ci-robot commented 1 day ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
pohly commented 1 day ago

One thing to note is that the code linked below doesn't consider the CEL expression selector in the gpu.nvidia.com and imex.nvidia.com device classes when calculating requestData.numDevices.

In the resourceapi.DeviceAllocationModeExactCount case the numDevices is exactly what is requested. In the resourceapi.DeviceAllocationModeAll case, isSelectable checks the device class selector.

klueska commented 1 day ago

Hmm, OK, I should have said (doesn't appear to consider...). Thanks for the clarification.

pohly commented 1 day ago

Can you simplify? Does it still happen with just two nodes and two pods?

klueska commented 1 day ago

It even happens with the following:

---
apiVersion: v1
kind: Namespace
metadata:
  name: imex-test1
---
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  namespace: imex-test1
  name: shared-imex-channel0
spec:
  devices:
    requests:
    - name: channel
      deviceClassName: imex.nvidia.com
---
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaimTemplate
metadata:
  namespace: imex-test1
  name: all-node-gpus
spec:
  spec:
    devices:
      requests:
      - name: all-gpus
        deviceClassName: gpu.nvidia.com
        allocationMode: All
---
apiVersion: v1
kind: Pod
metadata:
  namespace: imex-test1
  name: pod0
spec:
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: gpus
      - name: imex-channel
  resourceClaims:
  - name: gpus
    resourceClaimTemplateName: all-node-gpus
  - name: imex-channel
    resourceClaimName: shared-imex-channel0

The error is:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  7s (x12 over 93s)  default-scheduler  running "DynamicResources" filter plugin: stop allocation

And (as before) removing allocationMode: All allows it to succeed in scheduling (with the correct set of expected resources).

klueska commented 1 day ago

One thing to note is that I am using the same driver to advertise both node-local resources and network-attached resources. Maybe something in the allocation logic never considered this case?

pohly commented 1 day ago

The allocator doesn't care about who publishes the resource slices.

With just one pod, it's simple enough to fit into https://github.com/kubernetes/kubernetes/blob/df5787a57fe391b75767028ea2a0255c82185680/staging/src/k8s.io/dynamic-resource-allocation/structured/allocator_test.go#L358-L367

I'll try...

/assign

kannon92 commented 1 day ago

/sig node /label wg-devicemanagement

k8s-ci-robot commented 1 day ago

@kannon92: The label(s) /label wg-devicemanagement cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor, official-cve-feed. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to [this](https://github.com/kubernetes/kubernetes/issues/127554#issuecomment-2368065091): >/sig node >/label wg-devicemanagement Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
kannon92 commented 1 day ago

/wg device-management

k8s-ci-robot commented 1 day ago

@kannon92: The label(s) wg/device-management cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/kubernetes/kubernetes/issues/127554#issuecomment-2368069921): >/wg device-management Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
pohly commented 1 day ago

We seem to lack some configuration for our working group... the label is missing.

In the resourceapi.DeviceAllocationModeAll case, isSelectable checks the device class selector.

@klueska was on the right track. It should have checked, but didn't. The code was there, but the class was not set yet.

running "DynamicResources" filter plugin: stop allocation

That "stop allocation" error also is wrong. It's used internally, but shouldn't get passed on because it's meaningless for users. Instead they should be told that allocation of all claims wasn't possible. I fixed that and also added several more test cases.

pohly commented 1 day ago

=> https://github.com/kubernetes/kubernetes/pull/127565