GaiaGPU: Sharing GPUs in Container Clouds

houminz commented 3 years ago

论文：https://ieeexplore.ieee.org/document/8672318

来源：腾讯TEG数据平台部Gaia团队

概述：实现容器间的GPU资源共享，其中 GPUManager 已在Github开源

在KubeCon 2018上，腾讯对GaiaStack的介绍在这里

houminz commented 3 years ago

背景：容器在云计算中因为其轻量性和可扩展性而得到广泛应用，GPU在深度学习等场景下被广泛用于加速计算，如何在容器间共享GPU资源，提高GPU利用率得到广泛研究。GaiaGPU 通过将虚拟的GPU分割成若干虚拟GPU，实现GPU memory和计算资源的隔离与共享

GaiaGPU的实现主要分为两个部分：Kubernetes 部分和 vCUDA 部分

Kubernetes部分基于 Kubernetes 的 Extended Resources、Device Plugin 和 Scheduler Extender机制，实现了下面两个项目
- GPU Manager ：实现为一个 Device Plugin，与 NVIDIA 的 k8s-device-plugin 相比，不需要额外配置 nvidia-docker2，使用的是原生的 runc
- GPU Admission：实现为一个Scheduler Extender，注意这里的Extender在论文中没有提到，下图中的GPU Scheduler实现的是topology 的选卡，属于现在GPU Manager项目的一部分
vCUDA 部分通过 vcuda-controller 来实现，作为 NVIDIA 的 CUDA 库的封装

houminz commented 3 years ago

下面对这几个组件依次分析，首先是 GPU Manager，实际上就是一个 Device Plugin，负责创建 vGPUS 和与 kubelet 通信。如果对Device Plugin不了解可以先看看这里

与阿里的 GPUShare 不同，GPU Manager 在 ListAndWatch 返回给Kubelet的是 a list of vGPUs，而不是实际的GPU设备。
GPU被虚拟化为两个资源维度，memory 和 computing resource
- memory：以256M内存作为单位，每个memory unit叫做 vmemory device
- computing resource：将一个物理GPU划分为100个 vprocessor devices，每个 vprocessor 占有 1%的GPU利用率
用户申请具有GPU的Pod资源Manifest如下：
```
apiVersion: v1
kind: Pod
metadata:
name: vcuda
spec:
restartPolicy: Never
hostNetwork: true
containers:
- image: tensorflow
name: test-container
command: ['/usr/local/nvidia/bin/nvidia-smi']
resources:
  requests:
    tencent.com/vcuda-core: 50
    tencent.com/vcuda-memory: 30
  limits:
    tencent.com/vcuda-core: 50
    tencent.com/vcuda-memory: 30
```
tencent.com/vcuda-core 和 tencent.com/vcuda-memory 是新增的针对单卡共享的一个资源标记，core对应的是使用率，单张卡有100个core，memory是显存，每个单位是256MB的显存。如果申请的资源为50%利用率，7680MB显存。tencent.com/vcuda-core 填写50， tencent.com/vcuda-memory 填写成30。那么当然我们也同样支持原来的独占卡的方式，只需要在core的地方填写100的整数倍，memory值填写大于0的任意值即可。
用户创建Pod之后，经过调度找到对应的Node，这时候Kubelet向DevicePlugin执行Allocate函数。因为Kubelet看到的是虚拟的Devices，这里需要有一个从虚拟Device到实际GPU Device的映射，这里就是上图中GPU Manager做的事情，然后发送一个Request给GPU Scheduler，根据拓扑关系选择最合适的GPU，然后GPU Manager将 AllocateResponse返回给Kubelet。
AllocateResponse包括：
- Environment variables of the container
- The directories or files mounted into the container (e.g., NVIDIA Driver, CUDA libraries)
- The assigned devices.

houminz commented 3 years ago

接下来是vGPU的管理，对应论文中的 vGPU Manager 和 vGPU Library，其中 vGPU Library实际实现的是 vcuda-controller

vGPU Manager 最后从属于 GPU Manager项目的一部分，作为DaemonSet会运行在每个Node之上。当一个容器申请了 Container 资源，论文图中的 GPU Manager 会将容器配置比如申请的GPU资源大小，容器的名字发送给 vGPU Manager。 vGPU Manager 收到容器的配置之后，会为这个容器在Host上创建一个独特的以容器名命名的目录，并且会将这个目录返回到 AllocateResponse 里面，最终返回给 kubelet。

vGPU Manager 和 vGPU Library通过 server-client 的模式通信。vGPU Manager保存了一个容器list，这个list中是那些使用了GPU并且还Alive的容器。vGPU Manager 会周期性的检查这个列表，如果有容器销毁，那么会从列表中将其移除，并删除其在本地的目录。

houminz commented 3 years ago

接下来是最关键的部分，vCUDA Library的实现，它通过劫持 vCUDA API 的调用来做资源隔离，具体劫持的API如下表所示

这里的问题是，vCUDA Library 是如何做到注射到容器之内的呢？

               Host                     |                Container
                                        |
                                        |
 .-----------.                          |
 | allocator |----------.               |             ___________
 '-----------'   PodUID |               |             \          \
                        v               |              ) User App )--------.
               .-----------------.      |             /__________/         |
    .----------| virtual-manager |      |                                  |
    |          '-----------------'      |                                  |
$VirtualManagerPath/PodUID              |                                  |
    |                                   |       read /proc/self/cgroup     |
    |  .------------------.             |       to get PodUID, ContainerID |
    '->| create directory |------.      |                                  |
       '------------------'      |      |                                  |
                                 |      |                                  |
                .----------------'      |       .----------------------.   |
                |                       |       | fork call gpu-client |<--'
                |                       |       '----------------------'
                v                       |                   |
   .------------------------.           |                   |
  ( wait for client register )<-------PodUID, ContainerID---'
   '------------------------'           |
                |                       |
                v                       |
  .--------------------------.          |
  | locate pod and container |          |
  '--------------------------'          |
                |                       |
                v                       |
  .---------------------------.         |
  | write down configure and  |         |
  | pid file with containerID |         |
  | as name                   |         |
  '---------------------------'         |
                                        |
                                        |
                                        v

houminz commented 3 years ago

把这部分论文分析总结在了博客里面

houminz / paper-reading

GaiaGPU: Sharing GPUs in Container Clouds #1