Closed MondayCha closed 4 months ago
The resource reclaim happened when reclaim
action is enabled in configMap.
Could you paste your job yaml and voclano scheduler logs after adjusted log level to 4?
Thanks.
nvidia.com/t4 ,nvidia.com/v100 这个资源类型,是怎么设置的?device plugin ?
nvidia.com/t4 ,nvidia.com/v100 这个资源类型,是怎么设置的?device plugin ?
@Murphylu1993 我在 Volcano v1.9.0 配置多维弹性 Capacity 调度 记录了配置过程,但这个网站必须登录才能查看,我正在写一个更详细的英文版本。
I documented the configuration process at https://zhuanlan.zhihu.com/p/705340911, but this website currently requires login to view, and I am working on writing a more detailed English version and trying to post it to the community.
Hi @MondayCha, I would be very interested in your detailed English version documenting how you configured the nvidia-device-plugin to report the different types of GPUs as extended resources.
Would you be able to share, even if it's not fully revised/polished?
Hi @MondayCha, I would be very interested in your detailed English version documenting how you configured the nvidia-device-plugin to report the different types of GPUs as extended resources.
Would you be able to share, even if it's not fully revised/polished?
Hello@lajd, after communicating with the Volcano community, it seems that the official SOP might be available by the end of August.
In the meantime, I can share my own configuration (translated into English via ChatGPT).
Volcano v1.9.0 introduces Capacity scheduling capabilities. However, the default Nvidia Device Plugin reports resources as nvidia.com/gpu
, which does not support reporting different GPU models as shown in the example. To address this, you need to configure three steps:
Initially, we used the NVIDIA GPU Operator to manage GPU resources uniformly, with GFD and other functions already configured. Since we have NVIDIA drivers installed and need a customized Device Plugin, we need to configure the GPU Operator to enable DCGM Exporter and disable driver and Device Plugin management.
Volcano provides queue-based resource capabilities, but to report different types of GPUs, the Device Plugin needs to be adapted.
When installing the Device Plugin via Helm, specify the configuration file:
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.15.0 \
--namespace nvidia-device-plugin \
--create-namespace \
--set config.default=other-config \
--set-file config.map.other-config=other-config.yaml \
--set-file config.map.p100-config=p100-config.yaml \
--set-file config.map.v100-config=v100-config.yaml
Configuration file content:
version: v1
flags:
migStrategy: "none"
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: false
deviceListStrategy: envvar
deviceIDStrategy: uuid
resources:
gpus:
- pattern: "Tesla V100-SXM2-32GB"
name: v100
- pattern: "Tesla P100-PCIE-*"
name: p100
- pattern: "NVIDIA GeForce RTX 2080 Ti"
name: 2080ti
- pattern: "NVIDIA TITAN Xp"
name: titan
- pattern: "Tesla T4"
name: t4
Modify the Nvidia Device Plugin source code.
Additionally, due to the Go version of my device, I needed to modify the Dockerfile and repackage the image. After modifying and repackaging, replace the Daemonset image with the new version to support marking different types of GPUs as different resources.
Although we have reported new resources, the previous GPU labels will not disappear:
kubectl get nodes -ojson | jq '.items[] | {name: .metadata.name, allocatable: .status.allocatable}'
Sample output:
{
"name": "huawei-82",
"allocatable": {
"cpu": "80",
"ephemeral-storage": "846624789946",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "263491632Ki",
"nvidia.com/gpu": "0",
"nvidia.com/t4": "2",
"pods": "110"
}
}
Start kubectl proxy
:
kubectl proxy
# Starting to serve on 127.0.0.1:8001
Deletion script (note / needs to be escaped as ~1):
#!/bin/bash
# Check if a node name is provided
if [ -z "$1" ]; then
echo "Usage: $0 <node-name>"
exit 1
fi
NODE_NAME=$1
# Prepare the JSON patch data
PATCH_DATA=$(cat <<EOF
[
{"op": "remove", "path": "/status/capacity/nvidia.com~1gpu"}
]
EOF
)
# Execute the PATCH request
curl --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data "$PATCH_DATA" \
http://127.0.0.1:8001/api/v1/nodes/$NODE_NAME/status
echo "Patch request sent for node $NODE_NAME"
Pass the Node name and clean up:
vim patch_node_gpu.sh
./patch_node_gpu.sh huawei-82
This completes the first stage: re-reporting GPU resources.
After changing the GPU resource name, we found that DCGM Exporter could not obtain Pod-level GPU usage metrics. The reason is that DCGM Exporter must fully match the resource name nvidia.com/gpu
or those with the prefix nvidia.com/mig-
.
To address this, modify the DCGM Exporter logic, repackage the image, and replace it.
Volcano provides a guide titled "How to use capacity plugin", but this guide is not entirely accurate. When configuring the scheduler ConfigMap, you also need to add the reclaim plugin to enable elasticity.
kind: ConfigMap
apiVersion: v1
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill, reclaim" # add reclaim
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: conformance
- plugins:
- name: drf
enablePreemptable: false
- name: predicates
- name: capacity # add this field and remove proportion plugin.
- name: nodeorder
- name: binpack
Additionally, when a Pod requests multiple dimensions of resources (such as CPU, memory, GPU), ensure that each dimension of resources does not exceed the Deserved value to avoid preemption.
Thanks you @MondayCha , much appreciated!
Please provide an in-depth description of the question you have:
I am trying to configure the Capacity Plugin to reclaim resources that exceed the "deserved" amount for other queues. Despite my efforts, I haven't been able to achieve the desired behavior.
I followed the configuration guide at https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_capacity_plugin.md#environment-setup to set up the scheduler ConfigMap. My cluster has ample CPU and Memory resources, but only 4 nvidia.com/t4 GPUs.
Initially, I set the "deserved" value for nvidia.com/t4 to 1 in queue 1, and then submitted 3 Jobs requesting 3 nvidia.com/t4 GPUs.
After that, I set the "deserved" value for nvidia.com/t4 to 2 in queue 2 and submitted 2 Jobs requesting 2 nvidia.com/t4 GPUs, but the resources were not reclaimed as expected.
What do you think about this question?:
Additionally, I attempted to add
actions: "enqueue, allocate, backfill, reclaim, preempt"
to the ConfigMap, which resulted in frequent preemptions but still did not achieve the desired behavior for the Capacity Plugin.I suspect that some configurations might be missing in the documentation. For example, I noticed the new EnablePreemptive setting introduced in this MR #3283, but I am unsure how it should be used.
Could you please provide guidance on the necessary configuration?
Environment:
kubectl version
): v1.26.9uname -a
): 5.15.0-106-generic