the-full-stack / fsdl-text-recognizer-2021-labs

Complete deep learning project developed in Full Stack Deep Learning, Spring 2021
https://bit.ly/berkeleyfsdl
MIT License
450 stars 278 forks source link

RuntimeError: CUDA error: no kernel image is available for execution on the device #12

Open ammarasmro opened 3 years ago

ammarasmro commented 3 years ago

System:

GPU: 3080

 python training/run_experiment.py --model_class=MLP --data_class=MNIST --max_epochs=5 --gpus=-1

Followed mentioned steps but ended up with this error

RuntimeError: CUDA error: no kernel image is available for execution on the device

Complete output

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/torch/cuda/__init__.py:104: UserWarning:
GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
If you want to use the GeForce RTX 3080 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

  | Name          | Type     | Params
-------------------------------------------
0 | model         | MLP      | 936 K
1 | model.dropout | Dropout  | 0
2 | model.fc1     | Linear   | 803 K
3 | model.fc2     | Linear   | 131 K
4 | model.fc3     | Linear   | 1.3 K
5 | train_acc     | Accuracy | 0
6 | val_acc       | Accuracy | 0
7 | test_acc      | Accuracy | 0
-------------------------------------------
936 K     Trainable params
0         Non-trainable params
936 K     Total params
/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:49: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 20 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Validation sanity check: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "training/run_experiment.py", line 90, in <module>
    main()
  File "training/run_experiment.py", line 85, in main
    trainer.fit(lit_model, datamodule=data)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
    results = self.accelerator_backend.train()
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 66, in train
    results = self.train_or_test()
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 69, in train_or_test
    results = self.trainer.train()
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 495, in train
    self.run_sanity_check(self.get_model())
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 693, in run_sanity_check
    _, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 609, in run_evaluation
    output = self.evaluation_loop.evaluation_step(test_mode, batch, batch_idx, dataloader_idx)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 178, in evaluation_step
    output = self.trainer.accelerator_backend.validation_step(args)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 84, in validation_step
    return self._step(self.trainer.model.validation_step, args)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 76, in _step
    output = model_step(*args)
  File "/mnt/c/Users/user/GitHub/fsdl-text-recognizer-2021-labs/lab1/text_recognizer/lit_models/base.py", line 58, in validation_step
    logits = self(x)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/c/Users/user/GitHub/fsdl-text-recognizer-2021-labs/lab1/text_recognizer/lit_models/base.py", line 45, in forward
    return self.model(x)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/c/Users/user/GitHub/fsdl-text-recognizer-2021-labs/lab1/text_recognizer/models/mlp.py", line 37, in forward
    x = self.fc1(x)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/myuser/miniconda3/envs/fsdl-text-recognizer-2021/lib/python3.6/site-packages/torch/nn/functional.py", line 1690, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: no kernel image is available for execution on the device
ammarasmro commented 3 years ago

Currently getting around it with

  1. Change cuda version in environment.yml
  2. Remove cudnn line from environment.yml
  3. After setting the labs up. Run this command conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch

And lab1 passes. Not sure if it completely solves the problem though

tranhoangkhuongvn commented 3 years ago

I modified the below to make it work on my RTX3090 + Ubuntu 20:

sunki-hong commented 3 years ago

RTX3070 + Ubuntu 18.04