Closed EzoBear closed 5 years ago
If you have 3 GPUs, then they will be indexed as 0, 1, 2
, not 1, 2, 3
. So, you should use CUDA_VISIBLE_DEVICES=1,2
to use last two. Adding -d 1,2
to the command does the same thing.
But I don't understand why does it use GPU:0 (not 2) too, and this problem is not being reproduced in my machines.
Can you try python train.py -c config.json -d 1,2
and tell me how it goes?
Someone is using gpu 0, and I wanted to use 1 and 2. when i set gpu option with the -d option that you told, i do not get an error.
but i wonder why work when set device index from args.device to os.environ["CUDA_VISIBLE_DEVICES"] in train.py. but CUDA_VISIBLE_DEVICES=2,3 is not working. i think it's same operator.
thank you.
but this solution make error. because tensor use "cuda:0". so data parallel module is occur error following.
File "/home/yjk931004/anaconda2/envs/mypython3/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 54, in forward assert all(map(lambda i: i.is_cuda, inputs))
When i execute train.py using command CUDA_VISIBLE_DEVICES=2,3 python train.py -c config.json (n_gpu option is 2), still used and started index is 0 gpu. because list_ids is made just range(n_gpu_use), so if n_gpu is 2, list_ids is [0,1].