kubeflow / arena

A CLI for Kubeflow.
Apache License 2.0
730 stars 177 forks source link

support claiming multiple type devices resources requests&limits #1121

Closed lizhiboo closed 4 days ago

lizhiboo commented 2 weeks ago

Motivation: Arena uses nvidia gpu by default, haven't yet supported other chip vendors such as AMD, Ascend, Hygon etc.

Design: add --device parameter to set gpu request in Pod's resources, as below:

      resources:
        limits:
          cpu: "10"
          memory: 32Gi
          hygon.com/dcu: 1
        requests:
          cpu: "10"
          memory: 32Gi
          hygon.com/dcu: 1

Usage:

arena submit tfjob \
    --name=tfjobtest\
    --working-dir=/root \
    --ps-gpus=1 \
    --ps=1 \
    --workers=1 \
    --device=hygon.com/dcu=1 \
    --data-dir=/usr/local/hg-lib:/usr/local/hg-lib \
    --image=xxx:ascend_tensorflow_test \
    'sh -c train.sh'

arena serve custom \
    --name=cstest\
    --replicas=1 \
    --port=80 \
    --device=huawei.com/Ascend910=1 \
    --data-dir=/usr/local/ascend910-driver:/usr/local/ascend910-driver \
    --image=xxx:ascend-test \
    --command="sh train.sh"