rapidsai / deployment

RAPIDS Deployment Documentation
https://docs.rapids.ai/deployment/stable/
9 stars 27 forks source link

https://docs.rapids.ai/deployment/nightly/cloud/azure/azure-vm #378

Closed jacobtomlinson closed 1 month ago

jacobtomlinson commented 1 month ago

Following the instructions up to the point of SSHing into the VM I am noticing that the NVIDIA driver install fails (but still reports (install complete).

Welcome to the NVIDIA GPU Cloud image.  This image provides an optimized
environment for running the deep learning and HPC containers from the
NVIDIA GPU Cloud Container Registry.  Many NGC containers are freely
available.  However, some NGC containers require that you log in with
a valid NGC API key in order to access them.  This is indicated by a
"pull access denied for xyz ..." or "Get xyz: unauthorized: ..." error
message from the daemon.

Documentation on using this image and accessing the NVIDIA GPU Cloud
Container Registry can be found at
  http://docs.nvidia.com/ngc/index.html

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

Installing drivers ...

modprobe: FATAL: Module nvidia not found in directory /lib/modules/6.2.0-1011-azure
Install complete

If I check nvidia-smi it can't be found.

$ nvidia-smi
Command 'nvidia-smi' not found, but can be installed with:
...

This is out of scope for our testing, but I am unable to proceed with running the RAPIDS container due to the missing drivers. I could manually install the driver and the container toolkit but users shouldn't be expected to handle this failure case. I'm going to explore where we can report this to.

jacobtomlinson commented 1 month ago

I deleted the VM and ran again with exactly the same configuration and this time it worked. Looks like it was a transient error. I've fed back to the team that maintains the image that a more helpful error message would be a nice improvement.