RuntimeError: CUDA error: no kernel image is available for execution on the device

ammarasmro commented 3 years ago

System:

WSL2

GPU: 3080

 python training/run_experiment.py --model_class=MLP --data_class=MNIST --max_epochs=5 --gpus=-1

Followed mentioned steps but ended up with this error

RuntimeError: CUDA error: no kernel image is available for execution on the device

Complete output

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/torch/cuda/__init__.py:104: UserWarning:
GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
If you want to use the GeForce RTX 3080 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

  | Name          | Type     | Params
-------------------------------------------
0 | model         | MLP      | 936 K
1 | model.dropout | Dropout  | 0
2 | model.fc1     | Linear   | 803 K
3 | model.fc2     | Linear   | 131 K
4 | model.fc3     | Linear   | 1.3 K
5 | train_acc     | Accuracy | 0
6 | val_acc       | Accuracy | 0
7 | test_acc      | Accuracy | 0
-------------------------------------------
936 K     Trainable params
0         Non-trainable params
936 K     Total params
/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:49: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 20 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Validation sanity check: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "training/run_experiment.py", line 90, in <module>
    main()
  File "training/run_experiment.py", line 85, in main
    trainer.fit(lit_model, datamodule=data)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
    results = self.accelerator_backend.train()
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 66, in train
    results = self.train_or_test()
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 69, in train_or_test
    results = self.trainer.train()
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 495, in train
    self.run_sanity_check(self.get_model())
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 693, in run_sanity_check
    _, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 609, in run_evaluation
    output = self.evaluation_loop.evaluation_step(test_mode, batch, batch_idx, dataloader_idx)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 178, in evaluation_step
    output = self.trainer.accelerator_backend.validation_step(args)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 84, in validation_step
    return self._step(self.trainer.model.validation_step, args)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 76, in _step
    output = model_step(*args)
  File "/mnt/c/Users/user/GitHub/fsdl-text-recognizer-2021-labs/lab1/text_recognizer/lit_models/base.py", line 58, in validation_step
    logits = self(x)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/c/Users/user/GitHub/fsdl-text-recognizer-2021-labs/lab1/text_recognizer/lit_models/base.py", line 45, in forward
    return self.model(x)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/c/Users/user/GitHub/fsdl-text-recognizer-2021-labs/lab1/text_recognizer/models/mlp.py", line 37, in forward
    x = self.fc1(x)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/torch/nn/functional.py", line 1690, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: no kernel image is available for execution on the device

ammarasmro commented 3 years ago

Currently getting around it with

Change cuda version in environment.yml
Remove cudnn line from environment.yml
After setting the labs up. Run this command conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch

And lab1 passes. Not sure if it completely solves the problem though

tranhoangkhuongvn commented 3 years ago

I modified the below to make it work on my RTX3090 + Ubuntu 20:

remove both cuda and cudnn versions in environment.yml
after setting the labs up via make conda-update, run conda install -c anaconda cudatoolkit
finally run conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch

sunki-hong commented 3 years ago

RTX3070 + Ubuntu 18.04

(if activated) conda deactivate
conda env remove -n fsdl-text-recognizer-2021

remove both cuda and cudnn versions in environment.yml as tranhoangkhuongvn mentioned

enviornment.yml will look like this

name: fsdl-text-recognizer-2021
channels:
  - defaults
dependencies:
 - python=3.6  # Google Colab is still on Python 3.6
  - pip
  - pip:
    - pip-tools

make conda-update
conda activate fsdl-text-recognizer-2021
make pip-tools
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge
- from https://pytorch.org/get-started/locally/

the-full-stack / fsdl-text-recognizer-2021-labs

RuntimeError: CUDA error: no kernel image is available for execution on the device #12