After upgrading the nvidia drivers on the host (e.g. with apt-get upgrade), nvidia tasks will fail to run due to driver mistmatch, e.g.:
$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
Rebooting the machine resolves this, but if that is not a convenient option for a server, this can be done manually: stop any running Xorg instances (e.g. sudo service stop gdm3) and then sudo rmmod nvidia. The latter may fail by listing submodules that are still running, so stop these as well, sudo rmmod nvidia-uvm. Then restart the nvidia drivers with sudo nvidia-smi to confirm GPU is back and running.
Running nvidia-docker instances, e.g. with docker run --gpus all ... should now work again as before. Should add this to the user docs when we get to writing down more stuff about CUDA images...
After upgrading the nvidia drivers on the host (e.g. with
apt-get upgrade
),nvidia
tasks will fail to run due to driver mistmatch, e.g.:Rebooting the machine resolves this, but if that is not a convenient option for a server, this can be done manually: stop any running Xorg instances (e.g.
sudo service stop gdm3
) and thensudo rmmod nvidia
. The latter may fail by listing submodules that are still running, so stop these as well,sudo rmmod nvidia-uvm
. Then restart the nvidia drivers withsudo nvidia-smi
to confirm GPU is back and running.see:
Running nvidia-docker instances, e.g. with
docker run --gpus all ...
should now work again as before. Should add this to the user docs when we get to writing down more stuff about CUDA images...