volcano-sh / devices

Device plugins for Volcano, e.g. GPU
Apache License 2.0
97 stars 41 forks source link

gpu number无法使用 #31

Open Trainbow opened 1 year ago

Trainbow commented 1 year ago

你好,我在尝试volcano gpu number的服务调度,在根据volcano的教程步骤安装之后,每一个带gpu的node都能够正确的显示有多少块gpu,但是在创建pod的时候,container的容器中没有volcano-gpu-number这一个环境变量,在里面输入nvidia-smi能够看到该节点所有的gpu,想问一下是否需要更改yaml文件?

Thor-wl commented 1 year ago

你好,我在尝试volcano gpu number的服务调度,在根据volcano的教程步骤安装之后,每一个带gpu的node都能够正确的显示有多少块gpu,但是在创建pod的时候,container的容器中没有volcano-gpu-number这一个环境变量,在里面输入nvidia-smi能够看到该节点所有的gpu,想问一下是否需要更改yaml文件?

Hey, which version do you make use of?

Trainbow commented 1 year ago

你好,我在尝试volcano gpu number的服务调度,在根据volcano的教程步骤安装之后,每一个带gpu的node都能够正确的显示有多少块gpu,但是在创建pod的时候,container的容器中没有volcano-gpu-number这一个环境变量,在里面输入nvidia-smi能够看到该节点所有的gpu,想问一下是否需要更改yaml文件?

Hey, which version do you make use of?

volcano-1.6.0

Thor-wl commented 1 year ago

/cc @wangyang0616 Can you help take a look?

wangyang0616 commented 1 year ago

/cc @wangyang0616 Can you help take a look?

ok, let me take a look

wangyang0616 commented 1 year ago

@Trainbow Is it convenient to post the yaml file for creating the test task? By the way, can it be successfully scheduled using the default scheduler of k8s?

Trainbow commented 1 year ago

@Trainbow Is it convenient to post the yaml file for creating the test task? By the way, can it be successfully scheduled using the default scheduler of k8s?

I used the sample yaml in vaolcano-gpu-number readme.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod1
  namespace: model
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          volcano.sh/gpu-number: 1 # requesting 1 gpu cards
          # nvidia.com/gpu: 1

I also installed nvidia's k8s-device-plugin for testing. For example, when the limits field used nvidia.com/gpu, the pod's container works well, and it has one gpu devices. When i used volcano.sh/gpu-number, the container's env doesn't have the variable VOLCANO_GPU_ALLOCATED, the NVIDIA_VISIBLE_DEVICES is all. I tried the gpu-sharing with volcano, according to the official tutorial to test, I can find the corresponding environment variables in the pod.

wangyang0616 commented 1 year ago

Volcano Device Plugin GPUSTRATEGY default is theShare mode, that is, you can use the Volcano.sh/GPU-MEMOMORY. If you use thevolcano.sh/gpu-number, you need number`, see for details: config-the-volcano-device-plugin-binary

Hope the above information is helpful to you.