volcano-sh / devices

Device plugins for Volcano, e.g. GPU
Apache License 2.0
97 stars 41 forks source link

Enable resource naming in config #68

Closed MondayCha closed 1 week ago

MondayCha commented 1 month ago

Motivation

Volcano v1.9.0 introduces Capacity scheduling capabilities, which makes it possible to configure different quotas for different types of GPU queues (important in production environments). For example:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: queue1
spec:
  reclaimable: true
  deserved: # set the deserved field.
    cpu: 2
    memeory: 8Gi
    nvidia.com/t4: 40
    nvidia.com/a100: 20

However, the default Nvidia Device Plugin reports resources as nvidia.com/gpu, which does not support reporting different GPU models as shown in the example.

To address this, we need to customize the device plugin.

Change Details

The NVIDIA community has already had discussions about this issue:

This PR is modified based on the above discussion.

Further Impact

GPU resource renaming will prevent the DCGM Exporter from obtaining pod-level GPU resource usage monitoring, since the DCGM Exporter must exactly match the resource name nvidia.com/gpu or those with a prefix of nvidia.com/mig-.

volcano-sh-bot commented 1 month ago

Welcome @MondayCha!

It looks like this is your first PR to volcano-sh/devices.

Thank you, and welcome to Volcano. :smiley:

Monokaix commented 1 month ago

Hi,please add more description about this pr,and use git commit -s to sign off your commit.

william-wang commented 1 month ago

Thanks for your contribution. I opened a issue #69 for this pr.

william-wang commented 1 month ago

@MondayCha Would you like to add a doc to guide how to configure and use it ?

hzxuzhonghu commented 1 month ago

/ok-to-test

Monokaix commented 1 week ago

/lgtm

volcano-sh-bot commented 1 week ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/volcano-sh/devices/blob/release-1.1/OWNERS)~~ [william-wang] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
Monokaix commented 1 week ago

/lgtm