volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.13k stars 953 forks source link

How to configure the Capacity Plugin to reclaim resources in v1.9.0? #3510

Closed MondayCha closed 4 months ago

MondayCha commented 4 months ago

Please provide an in-depth description of the question you have:

I am trying to configure the Capacity Plugin to reclaim resources that exceed the "deserved" amount for other queues. Despite my efforts, I haven't been able to achieve the desired behavior.

I followed the configuration guide at https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_capacity_plugin.md#environment-setup to set up the scheduler ConfigMap. My cluster has ample CPU and Memory resources, but only 4 nvidia.com/t4 GPUs.

actions: "enqueue, allocate, backfill"
tiers:
- plugins:
  - name: priority
  - name: gang
    enablePreemptable: false
  - name: conformance
- plugins:
  - name: drf
    enablePreemptable: false
  - name: predicates
  - name: capacity
  - name: nodeorder
  - name: binpack

Initially, I set the "deserved" value for nvidia.com/t4 to 1 in queue 1, and then submitted 3 Jobs requesting 3 nvidia.com/t4 GPUs.

{
  "queue": "q1",
  "deserved": {
    "nvidia.com/t4": "1"
  },
  "allocated": {
    "attachable-volumes-csi-rook-ceph.cephfs.csi.ceph.com": "11m",
    "cpu": "105",
    "memory": "296Gi",
    "nvidia.com/t4": "3",
    "nvidia.com/v100": "9",
    "pods": "11"
  }
}

After that, I set the "deserved" value for nvidia.com/t4 to 2 in queue 2 and submitted 2 Jobs requesting 2 nvidia.com/t4 GPUs, but the resources were not reclaimed as expected.

{
  "queue": "q2",
  "deserved": {
    "nvidia.com/t4": "2"
  },
  "allocated": {
    "attachable-volumes-csi-rook-ceph.cephfs.csi.ceph.com": "1m",
    "cpu": "1",
    "memory": "2Gi",
    "nvidia.com/t4": "1",
    "pods": "1"
  }
}
image

What do you think about this question?:

Additionally, I attempted to add actions: "enqueue, allocate, backfill, reclaim, preempt" to the ConfigMap, which resulted in frequent preemptions but still did not achieve the desired behavior for the Capacity Plugin.

I suspect that some configurations might be missing in the documentation. For example, I noticed the new EnablePreemptive setting introduced in this MR #3283, but I am unsure how it should be used.

Could you please provide guidance on the necessary configuration?

Environment:

Monokaix commented 4 months ago

The resource reclaim happened when reclaim action is enabled in configMap. Could you paste your job yaml and voclano scheduler logs after adjusted log level to 4?

MondayCha commented 4 months ago

Thanks.

Murphylu1993 commented 3 months ago

nvidia.com/t4 ,nvidia.com/v100 这个资源类型,是怎么设置的?device plugin ?

MondayCha commented 3 months ago

nvidia.com/t4 ,nvidia.com/v100 这个资源类型,是怎么设置的?device plugin ?

@Murphylu1993 我在 Volcano v1.9.0 配置多维弹性 Capacity 调度 记录了配置过程,但这个网站必须登录才能查看,我正在写一个更详细的英文版本。

I documented the configuration process at https://zhuanlan.zhihu.com/p/705340911, but this website currently requires login to view, and I am working on writing a more detailed English version and trying to post it to the community.

lajd commented 2 months ago

Hi @MondayCha, I would be very interested in your detailed English version documenting how you configured the nvidia-device-plugin to report the different types of GPUs as extended resources.

Would you be able to share, even if it's not fully revised/polished?

MondayCha commented 2 months ago

Hi @MondayCha, I would be very interested in your detailed English version documenting how you configured the nvidia-device-plugin to report the different types of GPUs as extended resources.

Would you be able to share, even if it's not fully revised/polished?

Hello@lajd, after communicating with the Volcano community, it seems that the official SOP might be available by the end of August.

In the meantime, I can share my own configuration (translated into English via ChatGPT).


Volcano v1.9.0 introduces Capacity scheduling capabilities. However, the default Nvidia Device Plugin reports resources as nvidia.com/gpu, which does not support reporting different GPU models as shown in the example. To address this, you need to configure three steps:

  1. Install a custom Device Plugin
  2. Configure DCGM Exporter for Pod-level monitoring
  3. Configure Volcano to use the Capacity scheduling plugin

1. Install a Custom Device Plugin

1.1 Configure GPU Operator and GPU Feature Discovery

Initially, we used the NVIDIA GPU Operator to manage GPU resources uniformly, with GFD and other functions already configured. Since we have NVIDIA drivers installed and need a customized Device Plugin, we need to configure the GPU Operator to enable DCGM Exporter and disable driver and Device Plugin management.

1.2 Install a Custom Device Plugin

Volcano provides queue-based resource capabilities, but to report different types of GPUs, the Device Plugin needs to be adapted.

When installing the Device Plugin via Helm, specify the configuration file:

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.15.0 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set config.default=other-config \
    --set-file config.map.other-config=other-config.yaml \
    --set-file config.map.p100-config=p100-config.yaml \
    --set-file config.map.v100-config=v100-config.yaml

Configuration file content:

version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
resources:
  gpus:
  - pattern: "Tesla V100-SXM2-32GB"
    name: v100
  - pattern: "Tesla P100-PCIE-*"
    name: p100
  - pattern: "NVIDIA GeForce RTX 2080 Ti"
    name: 2080ti
  - pattern: "NVIDIA TITAN Xp"
    name: titan
  - pattern: "Tesla T4"
    name: t4

Modify the Nvidia Device Plugin source code.

Additionally, due to the Go version of my device, I needed to modify the Dockerfile and repackage the image. After modifying and repackaging, replace the Daemonset image with the new version to support marking different types of GPUs as different resources.

1.3 Clean Up Outdated Device Plugin Resources

Although we have reported new resources, the previous GPU labels will not disappear:

kubectl get nodes -ojson | jq '.items[] | {name: .metadata.name, allocatable: .status.allocatable}'

Sample output:

{
  "name": "huawei-82",
  "allocatable": {
    "cpu": "80",
    "ephemeral-storage": "846624789946",
    "hugepages-1Gi": "0",
    "hugepages-2Mi": "0",
    "memory": "263491632Ki",
    "nvidia.com/gpu": "0",
    "nvidia.com/t4": "2",
    "pods": "110"
  }
}

Start kubectl proxy:

kubectl proxy
# Starting to serve on 127.0.0.1:8001

Deletion script (note / needs to be escaped as ~1):

#!/bin/bash

# Check if a node name is provided
if [ -z "$1" ]; then
  echo "Usage: $0 <node-name>"
  exit 1
fi

NODE_NAME=$1

# Prepare the JSON patch data
PATCH_DATA=$(cat <<EOF
[
  {"op": "remove", "path": "/status/capacity/nvidia.com~1gpu"}
]
EOF
)

# Execute the PATCH request
curl --header "Content-Type: application/json-patch+json" \
     --request PATCH \
     --data "$PATCH_DATA" \
     http://127.0.0.1:8001/api/v1/nodes/$NODE_NAME/status

echo "Patch request sent for node $NODE_NAME"

Pass the Node name and clean up:

vim patch_node_gpu.sh
./patch_node_gpu.sh huawei-82

This completes the first stage: re-reporting GPU resources.

2. Configure DCGM Exporter for Pod-Level Monitoring

After changing the GPU resource name, we found that DCGM Exporter could not obtain Pod-level GPU usage metrics. The reason is that DCGM Exporter must fully match the resource name nvidia.com/gpu or those with the prefix nvidia.com/mig-.

To address this, modify the DCGM Exporter logic, repackage the image, and replace it.

3. Configure Volcano to Use the Capacity Scheduling Plugin

Volcano provides a guide titled "How to use capacity plugin", but this guide is not entirely accurate. When configuring the scheduler ConfigMap, you also need to add the reclaim plugin to enable elasticity.

kind: ConfigMap
apiVersion: v1
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill, reclaim" # add reclaim
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
    - plugins:
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: capacity # add this field and remove proportion plugin.
      - name: nodeorder
      - name: binpack

Additionally, when a Pod requests multiple dimensions of resources (such as CPU, memory, GPU), ensure that each dimension of resources does not exceed the Deserved value to avoid preemption.

lajd commented 2 months ago

Thanks you @MondayCha , much appreciated!