Notes on upgrading cuda version on host

After upgrading the nvidia drivers on the host (e.g. with apt-get upgrade), nvidia tasks will fail to run due to driver mistmatch, e.g.:

 $ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

Rebooting the machine resolves this, but if that is not a convenient option for a server, this can be done manually: stop any running Xorg instances (e.g. sudo service stop gdm3) and then sudo rmmod nvidia. The latter may fail by listing submodules that are still running, so stop these as well, sudo rmmod nvidia-uvm. Then restart the nvidia drivers with sudo nvidia-smi to confirm GPU is back and running.

see:

https://forums.developer.nvidia.com/t/reset-driver-without-rebooting-on-linux/40625/2
https://forums.developer.nvidia.com/t/cant-install-new-driver-cannot-unload-module/63639

Running nvidia-docker instances, e.g. with docker run --gpus all ... should now work again as before. Should add this to the user docs when we get to writing down more stuff about CUDA images...

rocker-org / ml

Notes on upgrading cuda version on host #28