Open llajas opened 8 months ago
Could also look into using a custom driver container image.
More information on this - The gpu-operator
has been deployed successfully. This was contingent on quite a few customizations that were applied to metal6 uniquely, creating quite a bit of drift from other node configurations. These included:
Installing the RPMFusion drivers.
Installing the NVIDIA container toolkit.
Disabling both above components within the gpu-operator
helm chart (Since they are manually installed, there is no need to leverage the operator on these two pieces and attempts at doing so were not as fruitful).
Creating necessary symlinks (e.g., for ldconfig) - ln -s /sbin/ldconfig /sbin/ldconfig.real
Any other custom configurations required for GPU nodes.
With this in mind, it would make sense to add some sort of flag that can be set at cluster init (ie. make configure
) asking if Nvidia GPUs will be used and then apply the configuration above to the nodes that have a nvidia GPU
[root@metal6 ~]# lspci | grep -i nvidia
04:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2060] (rev a1)
05:00.0 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
06:00.0 Serial bus controller: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
More to come...
Commands for our Unraid based VM w/ GPU pass-through
# Install qemu tools
sudo dnf update -y
sudo dnf install qemu-guest-agent
sudo systemctl enable --now qemu-guest-agent
sudo systemctl status qemu-guest-agent # To verify QEMU tools are running
lsmod | grep virtio # Checking if virtio modules are loaded
modprobe virtio_serial # Ensuring virtio serial is loaded, might be needed for QEMU agent communication?
lspci # Likely checking for hardware details; not directly related to VM visibility in Unraid
sudo dnf install pciutils # Installing utilities for hardware introspection
# RPM Fusion Repository Setup
sudo dnf install https://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm
sudo dnf install https://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
sudo dnf install akmod-nvidia # Installing NVIDIA drivers
sudo dnf install xorg-x11-drv-nvidia-cuda # Installing the CUDA toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo dnf install nvidia-container-toolkit
# might not be necessary
export PATH=/usr/local/cuda/bin:${PATH}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
# Critical Symlink for ldconfig
ln -s /sbin/ldconfig /sbin/ldconfig.real
This project currently runs on Fedora OS, which is not a supported OS by the Nvidia
gpu-operator
helm chart. There are two major issues in conflict with the setup here:As a result, it is not feasible to use the
driver
component of the Nvidiagpu-operator
.The workaround to this is to install the drivers manually on the node using the following commands:
Once installed, this will allow us to leverage GPU's on that given node:
Implementation
The automation should be adjusted to:
This can be done via Ansible at node setup or by leveraging aspects within the cluster, such as NFD after the cluster has been established (Via SyncWave?).
An Ansible approach would detect hardware during setup, before the k3s setup. In-Cluster would leverage NFD and then using a DS with tolerations to only run on hosts that match the nvidia label.