rocker-org / ml

experimental machine learning container
GNU General Public License v2.0
50 stars 13 forks source link

Notes on upgrading cuda version on host #28

Open cboettig opened 4 years ago

cboettig commented 4 years ago

After upgrading the nvidia drivers on the host (e.g. with apt-get upgrade), nvidia tasks will fail to run due to driver mistmatch, e.g.:

 $ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

Rebooting the machine resolves this, but if that is not a convenient option for a server, this can be done manually: stop any running Xorg instances (e.g. sudo service stop gdm3) and then sudo rmmod nvidia. The latter may fail by listing submodules that are still running, so stop these as well, sudo rmmod nvidia-uvm. Then restart the nvidia drivers with sudo nvidia-smi to confirm GPU is back and running.

see:

Running nvidia-docker instances, e.g. with docker run --gpus all ... should now work again as before. Should add this to the user docs when we get to writing down more stuff about CUDA images...