invalid device ordinal - Githubissues

gingerhead22 commented 7 years ago

Error happened when I tried to run train_lsp.sh file, it said cuda error or invalid device: invalid device ordinal

I use Cygwin on windows 10. GPU is GTX 1070, and both cuda and cudnn works well.

Do I need 2 GPU to run alexnet?

Thank you very much.

@mitmul

ThorJonsson commented 7 years ago

I'm having the same problem, have you found a solution?

ThorJonsson commented 7 years ago

You can try to reset the 'gpus' number in the sh file pertaining to the dataset you're training on. This file is in the 'bash' folder.

gingerhead22 commented 7 years ago

Thank you for your suggestion. Yep, I reset my gpus as 0 and problem solved. However, there are more problem with pickle() command happened then.

StandWisdom commented 7 years ago

@gingerhead22 Hi,I want to ask, if I just have one GPU GTX 1070 , 'gpus' number is '0' or '1'? I don't what is my GPU number. I didn't find the explain of 'gpus number'.

gingerhead22 commented 7 years ago

@ShaoNeilz Hey, u should set it as 0, since the ordinal starts from 0

StandWisdom commented 7 years ago

@gingerhead22 thanks for your answer! May I ask one another question ?
When I run 'train_flic.sh' , I haven't gotten any extention about the trainning, and system will be halted. here is my terminal's context. This makes me really confused. OS: ubuntu 14.04 amd_64;python 2.7.x;

shaolizhi@standwisdom:/machinelearning/deeppose-master$ bash shells/train_flic.sh 2017-03-07 12:49:50,150 [INFO] sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0) 2017-03-07 12:49:50,151 [INFO] chainer version: 1.21.0 2017-03-07 12:49:50,151 [INFO] cuda: True, cudnn: False 2017-03-07 12:49:50,151 [INFO] Namespace(adam_alpha=0.001, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, base_zoom=1.5, batchsize=128, channel=3, coord_normalize=True, epoch=101, fliplr=True, fname_index=0, gcn=True, gpus='1', ignore_label=-1, im_size=220, img_dir='data/FLIC-full/images', joint_index=1, lr=0.01, lr_decay_freq=10, lr_decay_ratio=0.1, min_dim=0, model='models/AlexNet.py', n_joints=7, opt='Adam', resume_model=None, resume_opt=None, resume_param=None, rotate=True, rotate_range=10, seed=1701, show_log_iter=10, snapshot=10, symmetric_joints='[[2, 4], [1, 5], [0, 6]]', test_csv_fn='data/FLIC-full/test_joints.csv', test_freq=10, train_csv_fn='data/FLIC-full/train_joints.csv', translate=True, translate_range=5, valid_freq=5, weight_decay=0.0005, zoom=True, zoom_range=0.2) shells/train_flic.sh: 行 31: 23280 已杀死 CHAINER_TYPE_CHECK=0 python scripts/train.py --model models/AlexNet.py --gpus 1 --epoch 100 --batchsize 128 --snapshot 10 --valid_freq 5 --train_csv_fn data/FLIC-full/train_joints.csv --test_csv_fn data/FLIC-full/test_joints.csv --img_dir data/FLIC-full/images --test_freq 10 --seed 1701 --im_size 220 --fliplr --rotate --rotate_range 10 --zoom --zoom_range 0.2 --translate --translate_range 5 --coord_normalize --gcn --n_joints 7 --fname_index 0 --joint_index 1 --symmetric_joints "[[2, 4], [1, 5], [0, 6]]" --opt Adam

yzy-thu commented 6 years ago

@ShaoNeilz are you solved it? I have the same trouble.

mitmul / deeppose

invalid device ordinal #31