lvycoder / lvycoder.github.io

Blog
https://lvycoder.github.io/
5 stars 1 forks source link

Kubernetes 中运行gpu容器 #45

Open lvycoder opened 1 year ago

lvycoder commented 1 year ago

前言

NVIDIA Container Toolkit 是一种用于支持 NVIDIA GPU 的容器化解决方案,它提供了一组容器映像和工具,可用于在 NVIDIA GPU 上运行 Docker 容器。

NVIDIA Container Toolkit 主要包含以下组件:

先决条件

  1. 要将 containerd 安装为系统上的容器引擎,请安装一些先决条件模块
    sudo modprobe overlay \
    && sudo modprobe br_netfilter
  2. 您还可以确保这些是持久的
cat <<EOF | sudo tee /etc/modules-load.d/containerd.conf
overlay
br_netfilter
EOF
  1. 如果您打算将 containerd 用作 Kubernetes 的 CRI 运行时,请配置 sysctl 参数
cat <<EOF | sudo tee /etc/sysctl.d/99-kubernetes-cri.conf
net.bridge.bridge-nf-call-iptables  = 1
net.ipv4.ip_forward                 = 1
net.bridge.bridge-nf-call-ip6tables = 1
EOF
  1. 然后应用参数:
    sudo sysctl --system

安装 containerd

在先决条件之后,我们可以继续为您的 Linux 发行版安装 containerd,按照此处所述设置 Docker 存储库,这里使用的是ubuntu 系统

  1. 安装软件包以允许 apt 通过 HTTPS 使用存储库:

    sudo apt-get update
    sudo apt-get install \
    ca-certificates \
    curl \
    gnupg \
    lsb-release
  2. 添加存储库 GPG 密钥和存储库:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
  1. 安装containerd包

    sudo apt-get update \
    && sudo apt-get install -y containerd.io
  2. 使用默认的 config.toml 配置文件配置 containerd:

sudo mkdir -p /etc/containerd \
    && sudo containerd config default | sudo tee /etc/containerd/config.toml

要使用 NVIDIA Container Runtime,需要额外配置。应添加以下选项以将 nvidia 配置为运行时并使用 systemd 作为 cgroup 驱动程序。下面提供补丁:

cat <<EOF > containerd-config.patch
--- config.toml.orig    2020-12-18 18:21:41.884984894 +0000
+++ /etc/containerd/config.toml 2020-12-18 18:23:38.137796223 +0000
@@ -94,6 +94,15 @@
        privileged_without_host_devices = false
        base_runtime_spec = ""
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
+            SystemdCgroup = true
+       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
+          privileged_without_host_devices = false
+          runtime_engine = ""
+          runtime_root = ""
+          runtime_type = "io.containerd.runc.v1"
+          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
+            BinaryName = "/usr/bin/nvidia-container-runtime"
+            SystemdCgroup = true
    [plugins."io.containerd.grpc.v1.cri".cni]
    bin_dir = "/opt/cni/bin"
    conf_dir = "/etc/cni/net.d"
EOF
  1. 应用配置补丁后,重启 containerd:

    sudo systemctl restart containerd
  2. 您可以使用带有 ctr 工具的 Docker hello-world 容器来测试安装:

sudo ctr image pull docker.io/library/hello-world:latest \
    && sudo ctr run --rm -t docker.io/library/hello-world:latest hello-world
Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/

For more examples and ideas, visit:
https://docs.docker.com/get-started/

安装 NVIDIA Container Toolkit

安装 containerd 后,我们可以继续安装 NVIDIA Container Toolkit。对于 containerd,我们需要使用 nvidia-container-toolkit 包。有关包层次结构的更多详细信息,请参阅体系结构概述

  1. 首先,设置包存储库和 GPG 密钥:
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
    && curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
    && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  2. 安装NVIDIA Container Toolkit:
sudo apt-get update \
    && sudo apt-get install -y nvidia-container-toolkit

对于 1.6.0 之前的 NVIDIA Container Toolkit 版本,应使用 nvidia-docker 存储库并安装 nvidia-container-runtime 包。这意味着包存储库应设置如下:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
    && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  1. 可以通过运行以下命令来确认已安装的包:
    sudo apt list --installed *nvidia*

测试安装

  1. 我们可以测试一个 GPU 容器:
sudo ctr image pull docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04
sudo ctr run --rm -t \
    --runc-binary=/usr/bin/nvidia-container-runtime \
    --env NVIDIA_VISIBLE_DEVICES=all \
    docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04 \
    cuda-11.6.2-base-ubuntu20.04 nvidia-smi

您应该会看到类似于下图所示的输出:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

文章参考: