victoresque / pytorch-template

PyTorch deep learning projects made easy.
MIT License
4.7k stars 1.08k forks source link

Multi Gpu Usage Problem #47

Closed EzoBear closed 5 years ago

EzoBear commented 5 years ago

image

image

image

When i execute train.py using command CUDA_VISIBLE_DEVICES=2,3 python train.py -c config.json (n_gpu option is 2), still used and started index is 0 gpu. because list_ids is made just range(n_gpu_use), so if n_gpu is 2, list_ids is [0,1].

SunQpark commented 5 years ago

If you have 3 GPUs, then they will be indexed as 0, 1, 2, not 1, 2, 3. So, you should use CUDA_VISIBLE_DEVICES=1,2 to use last two. Adding -d 1,2 to the command does the same thing.

But I don't understand why does it use GPU:0 (not 2) too, and this problem is not being reproduced in my machines. Can you try python train.py -c config.json -d 1,2 and tell me how it goes?

EzoBear commented 5 years ago

Someone is using gpu 0, and I wanted to use 1 and 2. when i set gpu option with the -d option that you told, i do not get an error.

but i wonder why work when set device index from args.device to os.environ["CUDA_VISIBLE_DEVICES"] in train.py. but CUDA_VISIBLE_DEVICES=2,3 is not working. i think it's same operator.

thank you.

EzoBear commented 5 years ago

but this solution make error. because tensor use "cuda:0". so data parallel module is occur error following.

File "/home/yjk931004/anaconda2/envs/mypython3/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 54, in forward assert all(map(lambda i: i.is_cuda, inputs))

image