llajas / homelab

Modern self-hosting infrastructure, fully automated from empty disk to operating services with a single command.
https://homelab.lajas.tech
GNU General Public License v3.0
3 stars 0 forks source link

Automate Nvidia Driver Installation #14

Open llajas opened 8 months ago

llajas commented 8 months ago

This project currently runs on Fedora OS, which is not a supported OS by the Nvidia gpu-operator helm chart. There are two major issues in conflict with the setup here:

  1. The GPU Driver DaemonSet aspect of the helm chart appends what it finds to be the OS version of the node running the pod to the image version that is pulled, resulting in an image pull failure if not using a pre-approved operating system.
  2. While the deployed asset/object can be manually adjusted in-cluster to pull an image from a sister OS (RHEL/CoreOS), there are checks in place that attempt to ensure that the OS matches the pulled image within an init process, causing yet another failure.
❯ kc -n gpu-operator logs nvidia-driver-daemonset-nl7dg
DRIVER_ARCH is x86_64
+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=535.154.05
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ NUM_VGPU_DEVICES=0
+ NVIDIA_MODULE_PARAMS=()
+ NVIDIA_UVM_MODULE_PARAMS=()
+ NVIDIA_MODESET_MODULE_PARAMS=()
+ NVIDIA_PEERMEM_MODULE_PARAMS=()
+ TARGETARCH=amd64
+ USE_HOST_MOFED=false
+ DNF_RELEASEVER=
+ OPEN_KERNEL_MODULES_ENABLED=false
+ [[ false == \t\r\u\e ]]
+ KERNEL_TYPE=kernel
+ DRIVER_ARCH=x86_64
+ DRIVER_ARCH=x86_64
+ echo 'DRIVER_ARCH is x86_64'
+++ dirname -- /usr/local/bin/nvidia-driver
++ cd -- /usr/local/bin
++ pwd
+ SCRIPT_DIR=/usr/local/bin
+ source /usr/local/bin/common.sh
++ GPU_DIRECT_RDMA_ENABLED=false
++ GDS_ENABLED=false
++ GDRCOPY_ENABLED=false
+ '[' 1 -eq 0 ']'
+ command=init
+ shift
+ case "${command}" in
++ getopt -l accept-license -o a --
+ options=' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --'
++ set -- --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=6.7.4-100.fc38.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ _resolve_rhel_version
+ '[' -f /host-etc/os-release ']'
+ echo 'Resolving RHEL version...'
+ local version=
Resolving RHEL version...
++ cat /host-etc/os-release
++ grep '^ID='
++ awk -F= '{print $2}'
++ sed -e 's/^"//' -e 's/"$//'
+ local id=fedora
+ '[' fedora = rhcos ']'
+ '[' fedora = rhel ']'
+ '[' -z '' ']'
+ echo 'Could not resolve RHEL version'
Could not resolve RHEL version
+ return 1
+ exit 1

As a result, it is not feasible to use the driver component of the Nvidia gpu-operator.

The workaround to this is to install the drivers manually on the node using the following commands:

sudo dnf install https://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm
sudo dnf install https://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
sudo dnf update -y
sudo dnf install akmod-nvidia
sudo dnf install xorg-x11-drv-nvidia-cuda
sudo reboot

Once installed, this will allow us to leverage GPU's on that given node:

[root@metal6 ~]# nvidia-smi
Mon Feb 12 00:25:27 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060        Off | 00000000:04:00.0 Off |                  N/A |
| 31%   29C    P8               9W / 160W |      1MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Implementation

The automation should be adjusted to:

  1. Detect hardware on a node, namely Nvidia GPU's
  2. Run the process manually when detected

This can be done via Ansible at node setup or by leveraging aspects within the cluster, such as NFD after the cluster has been established (Via SyncWave?).

An Ansible approach would detect hardware during setup, before the k3s setup. In-Cluster would leverage NFD and then using a DS with tolerations to only run on hosts that match the nvidia label.

llajas commented 8 months ago

Could also look into using a custom driver container image.

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#running-a-custom-driver-image

llajas commented 8 months ago

More information on this - The gpu-operator has been deployed successfully. This was contingent on quite a few customizations that were applied to metal6 uniquely, creating quite a bit of drift from other node configurations. These included:

Installing the RPMFusion drivers. Installing the NVIDIA container toolkit. Disabling both above components within the gpu-operator helm chart (Since they are manually installed, there is no need to leverage the operator on these two pieces and attempts at doing so were not as fruitful). Creating necessary symlinks (e.g., for ldconfig) - ln -s /sbin/ldconfig /sbin/ldconfig.real Any other custom configurations required for GPU nodes.

With this in mind, it would make sense to add some sort of flag that can be set at cluster init (ie. make configure) asking if Nvidia GPUs will be used and then apply the configuration above to the nodes that have a nvidia GPU

[root@metal6 ~]# lspci | grep -i nvidia
04:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2060] (rev a1)
05:00.0 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
06:00.0 Serial bus controller: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)

More to come...

llajas commented 8 months ago

Commands for our Unraid based VM w/ GPU pass-through

# Install qemu tools
sudo dnf update -y
sudo dnf install qemu-guest-agent
sudo systemctl enable --now qemu-guest-agent

sudo systemctl status qemu-guest-agent  # To verify QEMU tools are running
lsmod | grep virtio  # Checking if virtio modules are loaded
modprobe virtio_serial  # Ensuring virtio serial is loaded, might be needed for QEMU agent communication?

lspci  # Likely checking for hardware details; not directly related to VM visibility in Unraid
sudo dnf install pciutils  # Installing utilities for hardware introspection

# RPM Fusion Repository Setup
sudo dnf install https://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm
sudo dnf install https://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm

sudo dnf install akmod-nvidia  # Installing NVIDIA drivers
sudo dnf install xorg-x11-drv-nvidia-cuda  # Installing the CUDA toolkit

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)    
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo dnf install nvidia-container-toolkit

# might not be necessary
export PATH=/usr/local/cuda/bin:${PATH}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}

# Critical Symlink for ldconfig 
ln -s /sbin/ldconfig /sbin/ldconfig.real