sortteam / HyKuFe

MIT License
4 stars 0 forks source link

[Cloud] Soma 서버에 nvidia-docker 설치 #14

Closed wonjong-yoo closed 5 years ago

wonjong-yoo commented 5 years ago

작업 목표

Done Of Done(Subtask)

commit hash

Reference

[Link] https://docs.nvidia.com/datacenter/kubernetes/index.html

wonjong-yoo commented 5 years ago

Nvidia-docker2 설치

nvidia docker는 docker 버전 19.03 이상을 요구한다. kubespray는 docker 18.09로 설치되어있으므로 삭제하고 최신버전으로 다시 설치한다.

    $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

    $ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable edge"

    $ sudo systemctl stop docker
    $ sudo apt remove docker-ce -y
    $ apt list docker-ce -a

    $ sudo apt install -y docker-ce="5:19.03.2~3-0~ubuntu-bionic"
    $ sudo systemctl restart kubelet
    # Add the package repositories

    $ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    $ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
    $ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

    $ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
    $ sudo apt-get update && sudo apt-get install -y nvidia-container-runtime
    $ sudo systemctl restart docker

Test

    #### Test nvidia-smi with the latest official CUDA image
    $ docker run --gpus all nvidia/cuda:9.0-base nvidia-smi

    # Start a GPU enabled container on two GPUs
    $ docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi

    # Starting a GPU enabled container on specific GPUs
    $ docker run --gpus '"device=1,2"' nvidia/cuda:9.0-base nvidia-smi
    $ docker run --gpus '"device=UUID-ABCDEF,1"' nvidia/cuda:9.0-base nvidia-smi

    # Specifying a capability (graphics, compute, ...) for my container
    # Note this is rarely if ever used this way
    $ docker run --gpus all,capabilities=utility nvidia/cuda:9.0-base nvidia-smi

Node 1 테스트 결과

    Sat Oct  5 10:37:26 2019
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce RTX 2080    Off  | 00000000:01:00.0  On |                  N/A |
    |  0%   31C    P8    11W / 260W |    237MiB /  7979MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+

    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+

Node 2 테스트 결과

    Sat Oct  5 10:39:47 2019
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce RTX 2080    Off  | 00000000:01:00.0  On |                  N/A |
    |  0%   32C    P8    14W / 260W |    252MiB /  7979MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+

    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+

Node 3 테스트 결과

    Sat Oct  5 10:45:25 2019
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce RTX 2080    Off  | 00000000:01:00.0 Off |                  N/A |
    |  0%   33C    P8    17W / 260W |     26MiB /  7979MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+

    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+

NVIDIA/k8s-device-plugin

Nvidia Device Plugin For Kubernetes

nvidia 런타임을 노드에서 기본 런타임으로 설정해야 한다. /etc/docker/daemon.json 파일을 다음과 같이 수정한다.

    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }

Enabling GPU Support in Kubernetes

다음 데몬셋을 쿠버네티스에 배포하자.

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta/nvidia-device-plugin.yml

Running GPU Jobs

    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-pod
    spec:
      containers:
        - name: cuda-container
          image: nvidia/cuda:9.0-devel
          resources:
            limits:
              nvidia.com/gpu: 2 # requesting 2 GPUs
        - name: digits-container
          image: nvidia/digits:6.0
          resources:
            limits:
              nvidia.com/gpu: 2 # requesting 2 GPUs