cuDNN error: CUDNN_STATUS_EXECUTION_FAILED - Githubissues

huawei-noah / vega

AutoML tools chain

http://www.noahlab.com.hk/opensource/vega/

Other

844 stars 175 forks source link

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #194

Closed Prokrus closed 2 years ago

Prokrus commented 2 years ago

When run the example: vega car.yaml, it comes the error: cudnn error: CUDNN_STATUS_EXECUTION_FAILED.

I run the example in conda enviroment. cudatoolkit: 10.0 cudnn: 7.6.5 torch: 1.3.0

zhangjiajin commented 2 years ago

What kind of accelerator card are you using? V100, or 3070? If it is not V100, reduce the values of num_workers and batchsize.

    dataset:
        type: Cifar10
        common:
            data_path: /cache/datasets/cifar10/
            train_portion: 0.5
            num_workers: 8    # <- 0
            drop_last: False
        train:
            shuffle: True
            batch_size: 128   # <- 64 or 32
        val:
            batch_size: 3500  # <- 128

Prokrus commented 2 years ago

The card is 3090. I have tried num_workers=1, batch_size = 4, The error is same.

zhangjiajin commented 2 years ago

Please run command nvidia-smi and paste all the displayed information.

zhangjiajin commented 2 years ago

Please run the following code and paste all the displayed information.

import torch
print(torch.__version__)
print(torch.version.cuda)

zhangjiajin commented 2 years ago

I guess the driver, cuda, and cudnn versions do not match. I suggest you run a simple model training code. Find out the cause of the problem first.

Prokrus commented 2 years ago

I've tried model training with my own code, it's same error. Seeems the compatibility problem. But if I build the enviroment myself, no error occurs. (cuda 11.0, pytorch 1.10.1)

Maybe the compatility problem between vega and my driver? GPU driver: 460.91.03

zhangjiajin commented 2 years ago

You'd try run Vega on cuda 11.0, pytorch 1.10.1

Please execute the following command:

pip install --no-deps ./noah_vega-1.7.1-py3-none-any.whl

Then install the dependency separately.

pip install click
pip install distributed
pip install numpy
pip install opencv-python-headless
pip install pandas
pip install pareto
pip install pillow
pip install psutil
pip install py-dag
pip install PyYAML
pip install pyzmq
pip install scikit-learn
pip install scipy
pip install tensorboardX
pip install thop

Prokrus commented 2 years ago

Thank you very much! It works, even some libraries are not compatible with vega 1.7.1.