Open hzy46 opened 3 years ago
Detailed Work Items for this issue:
computing_devices
folder. Use name like nvidia.com_gpu
defaultComputingDeviceType
from layout.yaml
(Use nvidia.com/gpu
if not found) PR #5165
hivedComputingDeviceList
in https://github.com/microsoft/pai/blob/d0aef5dc009794b4804027b4f21c78556024d2ec/src/rest-server/src/models/v2/job/k8s.jsjob_exporter
, node_exporter
, watchdog
codes to support different hardwaresIf all P0 items are done, we can support different hardwares in default scheduler. If all P1 items are done, we can support different hardwares in hived scheduler. P2 items are nice-to-have.
Test cases for rest-server:
./paictl.py service stop -n hivedscheduler cluster-configuration rest-server
services-configuration.yaml
: disable hivedschedulerlayout.yaml
: set the cluster workers' computing device type to a.b.com/c
e.g. :machine-sku:
master-machine: # define a machine sku
# the resource requirements for all the machines of this sku
# We use the same memory format as Kubernetes, e.g. Gi, Mi
# Reference: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory
mem: 60Gi
cpu:
# the number of CPU vcores
vcore: 24
gpu-machine:
computing-device:
type: a.b.com/c
model: faked
count: 4
mem: 220Gi
cpu:
vcore: 24
machine-list:
- hostname: pai-master # name of the machine, **do not** use upper case alphabet letters for hostname
hostip: 10.0.0.1
machine-type: master-machine # only one master-machine supported
pai-master: "true"
- hostname: pai-worker1
hostip: 10.0.0.2
machine-type: gpu-machine
pai-worker: "true"
- hostname: pai-worker2
hostip: 10.0.0.3
machine-type: gpu-machine
pai-worker: "true"
………………
./paictl.py service start -n hivedscheduler cluster-configuration rest-server
a.b.com/c
resource request in the pod spec./paictl.py service stop -n hivedscheduler cluster-configuration rest-server
services-configuration.yaml
: enable hivedscheduler; set rest-server.hived-computing-device-envs
to TEST,NVIDIA_VISIBLE_DEVICES,HIVED_VISIBLE_DEVICES
layout.yaml
: set the cluster workers' computing device type back to nvidia.com/gpu
./paictl.py service start -n hivedscheduler cluster-configuration rest-server
TEST
is set to something like 0,1,.....
Motivation
Currently, OpenPAI has supported the most widely used computing devices: Nvidia GPU, AMD GPU and CPU. In addition, it has the potential to support other types of device, e.g. AI computing chips (NPU).
Goal
Decouple OpenPAI services and specific hardware types. One OpenPAI service container can support a list of hardware types.
Requirements
For every type of computing device, the vendor should guarantee:
MVP with default scheduler
By assuming that there is only one type of computing device in a cluster, we could build a minimal viable solution with the default scheduler by
ComputeDevice
(default isnvidia.com/gpu
) in deployment and record it in configmapComputeDevice
in quick startnvidia.com/gpu
toComputeDevice
in rest serverhttps://github.com/microsoft/pai/blob/2fb370a59387f7df5e6cec9d30d194f3af19e2d9/src/rest-server/src/models/v2/job/k8s.js#L483-L487
Beside the necessary works, we (pai-dev team and device vendor) could make better support by
devices
subfolders. The basic idea is to quick locate device related codes and isolate codes for different devices (e.g. different device vendors should avoid editing the same file).If a component must support diverse types of computing device, there will be a
devices
folder in it. For PAI services, they should take these files into consideration in build time. And one container will support a list of different machine models. For other components like the deploy script, they should check these files in runtime.nvidia-smi
and prometheus exporterPerfect support with HiveD
By enabling HiveD, we could get better support
Some extra efforts must be done to achieve this
layout.yaml
#5151NVIDIA_VISIBLE_DEVICES
andPAI_AMD_VISIBLE_DEVICES
.https://github.com/microsoft/pai/blob/2fb370a59387f7df5e6cec9d30d194f3af19e2d9/src/rest-server/src/models/v2/job/k8s.js#L656-L676
Some optional work items include
layout.yaml
and HiveD skussku-(cpu,gpu,mem)
converting simply, predictably and decoupled with devices #5148.