Closed Prokrus closed 2 years ago
What kind of accelerator card are you using? V100, or 3070? If it is not V100, reduce the values of num_workers and batchsize.
dataset:
type: Cifar10
common:
data_path: /cache/datasets/cifar10/
train_portion: 0.5
num_workers: 8 # <- 0
drop_last: False
train:
shuffle: True
batch_size: 128 # <- 64 or 32
val:
batch_size: 3500 # <- 128
The card is 3090. I have tried num_workers=1, batch_size = 4, The error is same.
Please run command nvidia-smi
and paste all the displayed information.
Please run the following code and paste all the displayed information.
import torch
print(torch.__version__)
print(torch.version.cuda)
I guess the driver, cuda, and cudnn versions do not match. I suggest you run a simple model training code. Find out the cause of the problem first.
I've tried model training with my own code, it's same error. Seeems the compatibility problem. But if I build the enviroment myself, no error occurs. (cuda 11.0, pytorch 1.10.1)
Maybe the compatility problem between vega and my driver? GPU driver: 460.91.03
You'd try run Vega on cuda 11.0, pytorch 1.10.1
Please execute the following command:
pip install --no-deps ./noah_vega-1.7.1-py3-none-any.whl
Then install the dependency separately.
pip install click
pip install distributed
pip install numpy
pip install opencv-python-headless
pip install pandas
pip install pareto
pip install pillow
pip install psutil
pip install py-dag
pip install PyYAML
pip install pyzmq
pip install scikit-learn
pip install scipy
pip install tensorboardX
pip install thop
Thank you very much! It works, even some libraries are not compatible with vega 1.7.1.
When run the example: vega car.yaml, it comes the error: cudnn error: CUDNN_STATUS_EXECUTION_FAILED.
I run the example in conda enviroment. cudatoolkit: 10.0 cudnn: 7.6.5 torch: 1.3.0