maciej-sypetkowski / kaggle-rcic-1st

1st Place Solution for Kaggle Recursion Cellular Image Classification Challenge -- https://www.kaggle.com/c/recursion-cellular-image-classification/
MIT License
140 stars 40 forks source link

Found no NVIDIA driver on your system #5

Closed WurmD closed 4 years ago

WurmD commented 4 years ago

Hello,

After building the image, and running it

sudo docker build --tag testimage .
sudo docker run -t -i --privileged testimage bash
cd rcic/
python main.py --save testrun

We get

    Traceback (most recent call last):
      File "main.py", line 504, in <module>
        main(args)
      File "main.py", line 484, in main
        model = ModelAndLoss(args).cuda()
      File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 297, in cuda
        return self._apply(lambda t: t.cuda(device))
      File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 194, in _apply
        module._apply(fn)
      File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 194, in _apply
        module._apply(fn)
      File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 194, in _apply
        module._apply(fn)
      File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 216, in _apply
        param_applied = fn(param)
      File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 297, in <lambda>
        return self._apply(lambda t: t.cuda(device))
      File "/opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py", line 178, in _lazy_init
        _check_driver()
      File "/opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py", line 99, in _check_driver
        http://www.nvidia.com/Download/index.aspx""")
    AssertionError: 
    Found no NVIDIA driver on your system. Please check that you
    have an NVIDIA GPU and installed a driver from
    http://www.nvidia.com/Download/index.aspx

Note that outside docker the GPU is working as intended

$ nvidia-smi
Sat Sep 26 17:26:50 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 00000000:01:00.0 Off |                  N/A |
| 33%   41C    P8    11W / 180W |      1MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Am I not running your code as intended? What were the steps you took to run the code in docker in your machine?

maciej-sypetkowski commented 4 years ago

Add --gpus=all to your docker run command or change docker run to nvidia-docker run. By default docker doesn't pass gpus to the container. If you do this, you should be able to run nvidia-smi inside the container and get the same output as run outside the container.

WurmD commented 4 years ago

I confirm installing nvidia-container-toolkit as per https://stackoverflow.com/a/58432877/1734357 and then adding --gpus=all to docker run resolves it