Open klueska opened 1 day ago
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
One thing to note is that the code linked below doesn't consider the CEL expression selector in the gpu.nvidia.com and imex.nvidia.com device classes when calculating requestData.numDevices.
In the resourceapi.DeviceAllocationModeExactCount
case the numDevices
is exactly what is requested. In the resourceapi.DeviceAllocationModeAll
case, isSelectable
checks the device class selector.
Hmm, OK, I should have said (doesn't appear to consider...). Thanks for the clarification.
Can you simplify? Does it still happen with just two nodes and two pods?
It even happens with the following:
---
apiVersion: v1
kind: Namespace
metadata:
name: imex-test1
---
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
namespace: imex-test1
name: shared-imex-channel0
spec:
devices:
requests:
- name: channel
deviceClassName: imex.nvidia.com
---
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaimTemplate
metadata:
namespace: imex-test1
name: all-node-gpus
spec:
spec:
devices:
requests:
- name: all-gpus
deviceClassName: gpu.nvidia.com
allocationMode: All
---
apiVersion: v1
kind: Pod
metadata:
namespace: imex-test1
name: pod0
spec:
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: gpus
- name: imex-channel
resourceClaims:
- name: gpus
resourceClaimTemplateName: all-node-gpus
- name: imex-channel
resourceClaimName: shared-imex-channel0
The error is:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 7s (x12 over 93s) default-scheduler running "DynamicResources" filter plugin: stop allocation
And (as before) removing allocationMode: All
allows it to succeed in scheduling (with the correct set of expected resources).
One thing to note is that I am using the same driver to advertise both node-local resources and network-attached resources. Maybe something in the allocation logic never considered this case?
The allocator doesn't care about who publishes the resource slices.
With just one pod, it's simple enough to fit into https://github.com/kubernetes/kubernetes/blob/df5787a57fe391b75767028ea2a0255c82185680/staging/src/k8s.io/dynamic-resource-allocation/structured/allocator_test.go#L358-L367
I'll try...
/assign
/sig node /label wg-devicemanagement
@kannon92: The label(s) /label wg-devicemanagement
cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor, official-cve-feed
. Is this label configured under labels -> additional_labels
or labels -> restricted_labels
in plugin.yaml
?
/wg device-management
@kannon92: The label(s) wg/device-management
cannot be applied, because the repository doesn't have them.
We seem to lack some configuration for our working group... the label is missing.
In the resourceapi.DeviceAllocationModeAll case, isSelectable checks the device class selector.
@klueska was on the right track. It should have checked, but didn't. The code was there, but the class was not set yet.
running "DynamicResources" filter plugin: stop allocation
That "stop allocation" error also is wrong. It's used internally, but shouldn't get passed on because it's meaningless for users. Instead they should be told that allocation of all claims wasn't possible. I fixed that and also added several more test cases.
What happened?
I was testing the ability to allocate both node-local resources (GPUs) along with a new network attached resource called an IMEX channel.
My setup is as follows:
With this setup, I create two ResourceClaims, one ResourceClaim template, and 2 deployments. The two resource Claims are for two distinct IMEX channels, the ResourceClaimTemplate is for all GPUs on a given node, and the 2 deployments (of 4 replicas each) consume these claims / claimtemplates across each replica in order to simulate running 2 MPI jobs across two different IMEX domains.
Here are the specs:
When I run this, I do not get the resource allocated to each pod in each deployment as expected (instead all pods remain pending forever).
However, If I change to explicitly requesting 1 GPU from a node instead of using
allocationMode: All
, things work as expected.One thing to note is that the code linked below doesn't consider the CEL expression selector in the
gpu.nvidia.com
andimex.nvidia.com
device classes when calculatingrequestData.numDevices
. That might be factor in this somehow: https://github.com/kubernetes/kubernetes/blob/52095a8b7b9b75d67a3882a21a6647e4f90ade48/staging/src/k8s.io/dynamic-resource-allocation/structured/allocator.go#L176-L210What did you expect to happen?
My expectation was that we would see 4 pods from each deployment with the following set of resources:
deployment 0, pod0: 1 GPU from a node in the same IMEX domain as all other pods in deployment 0 IMEX channel 0 from the same IMEX domain as all other pods in deployment 0
deployment 0, pod1: 1 GPU from a node in the same IMEX domain as all other pods in deployment 0 IMEX channel 0 from the same IMEX domain as all other pods in deployment 0
deployment 0, pod2: 1 GPU from a node in the same IMEX domain as all other pods in deployment 0 IMEX channel 0 from the same IMEX domain as all other pods in deployment 0
deployment 0, pod3: 1 GPU from a node in the same IMEX domain as all other pods in deployment 0 IMEX channel 0 from the same IMEX domain as all other pods in deployment 0
deployment 1, pod0: 1 GPU from a node in the same IMEX domain as all other pods in deployment 1 IMEX channel 0 from the same IMEX domain as all other pods in deployment 1
deployment 1, pod1: 1 GPU from a node in the same IMEX domain as all other pods in deployment 1 IMEX channel 0 from the same IMEX domain as all other pods in deployment 1
deployment 1, pod2: 1 GPU from a node in the same IMEX domain as all other pods in deployment 1 IMEX channel 0 from the same IMEX domain as all other pods in deployment 1
deployment 1, pod3: 1 GPU from a node in the same IMEX domain as all other pods in deployment 1 IMEX channel 0 from the same IMEX domain as all other pods in deployment 1
How can we reproduce it (as minimally and precisely as possible)?
Apply the specs listed above in a cluster with the following deviceClasses and DRA resources available...
Here are my (relevant) device classes:
Here is the definition of one of my GPU nodes (the others look similar except for the node name):
Here is the definition of my 2 IMEX channel resourceSlices:
Anything else we need to know?
No response
Kubernetes version
Cloud provider
NONE
OS version
No response
Install tools
No response
Container runtime (CRI) and version (if applicable)
No response
Related plugins (CNI, CSI, ...) and versions (if applicable)
No response