Those are mainly used as docker hosts for some AI experiments in nvidia/cuda-based containers.
Recently all these machines started repeatedly loosing their ability to work with GPUs: nvidia-smi reports nvml driver/library version mismatch, PyTorch doesn't work with cuda placements, etc. I've fixed these problems by removing and reinstalling Nvidia drivers. But problem comes back 4-5 days after the fix. And this behaviour is observed on all 3 machines. And just to clarify - I do not run any updates, nor install any new packages on host OS - It looks like drivers just stops working without any reason.
Hi, I ran 3 machines with Pop-OS and Nvidia GPUs:
Those are mainly used as docker hosts for some AI experiments in nvidia/cuda-based containers.
Recently all these machines started repeatedly loosing their ability to work with GPUs: nvidia-smi reports nvml driver/library version mismatch, PyTorch doesn't work with cuda placements, etc. I've fixed these problems by removing and reinstalling Nvidia drivers. But problem comes back 4-5 days after the fix. And this behaviour is observed on all 3 machines. And just to clarify - I do not run any updates, nor install any new packages on host OS - It looks like drivers just stops working without any reason.
Does anyone has experienced similar problems?