soumith / imagenet-multiGPU.torch

an imagenet example in torch.
BSD 2-Clause "Simplified" License
402 stars 158 forks source link

nGPU=4 is slower than nGPU=1 #49

Closed chienlinhuang1116 closed 8 years ago

chienlinhuang1116 commented 8 years ago

Hi,

Do you know why the results showed that “nGPU=4” is still slower than “nGPU=1”?

-nGPU 1 -batchSize 128 Epoch: [1][2/10000] Time 0.835 Err 6.9080 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.005 Epoch: [1][3/10000] Time 0.834 Err 6.9073 Top1-%: 0.00 LR 1e-02 DataLoadingTime 3.233 Epoch: [1][4/10000] Time 0.834 Err 6.9104 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.003 Epoch: [1][5/10000] Time 0.834 Err 6.9075 Top1-%: 0.00 LR 1e-02 DataLoadingTime 2.871 Epoch: [1][6/10000] Time 0.836 Err 6.9064 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.003 Epoch: [1][7/10000] Time 0.833 Err 6.9077 Top1-%: 0.00 LR 1e-02 DataLoadingTime 2.776

-nGPU 4 -batchSize 512 Epoch: [1][2/10000] Time 3.915 Err 6.9070 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.003 Epoch: [1][3/10000] Time 4.449 Err 6.9081 Top1-%: 0.00 LR 1e-02 DataLoadingTime 11.843 Epoch: [1][4/10000] Time 3.906 Err 6.9079 Top1-%: 0.20 LR 1e-02 DataLoadingTime 0.005 Epoch: [1][5/10000] Time 3.898 Err 6.9078 Top1-%: 0.20 LR 1e-02 DataLoadingTime 7.108 Epoch: [1][6/10000] Time 3.902 Err 6.9079 Top1-%: 0.20 LR 1e-02 DataLoadingTime 0.005

-nGPU 8 -batchSize 1024 Epoch: [1][2/10000] Time 7.186 Err 6.9080 Top1-%: 0.20 LR 1e-02 DataLoadingTime 0.006 Epoch: [1][3/10000] Time 6.947 Err 6.9079 Top1-%: 0.10 LR 1e-02 DataLoadingTime 24.149 Epoch: [1][4/10000] Time 6.724 Err 6.9080 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.007 Epoch: [1][5/10000] Time 7.773 Err 6.9080 Top1-%: 0.10 LR 1e-02 DataLoadingTime 16.892 Epoch: [1][6/10000] Time 6.731 Err 6.9081 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.007

Thank you.

Chien-Lin

buttomnutstoast commented 8 years ago

We have confronted similar problems that using multi-GPU is much slower than single GPU

buttomnutstoast commented 8 years ago

OK, we find that package contributors have already fixed the problem which is mentioned in this post

soumith commented 8 years ago

hi, sorry about that, yes it should be fixed in the latest commit.

chienlinhuang1116 commented 8 years ago

Hi, I tested it and results still showed that “nGPU=4” is slower than “nGPU=1”. Do you have any comment? Thank you.

th main.lua -data ~/imagenet -nGPU 1 -batchSize 128 Epoch: [1][2/10000] Time 0.844 Err 6.9071 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][3/10000] Time 0.845 Err 6.9084 Top1-%: 0.00 LR 1e-02 DataLoadingTime 3.041 Epoch: [1][4/10000] Time 0.845 Err 6.9095 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.051 Epoch: [1][5/10000] Time 0.845 Err 6.9092 Top1-%: 0.00 LR 1e-02 DataLoadingTime 2.557 Epoch: [1][6/10000] Time 0.843 Err 6.9095 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.003

th main.lua -data ~/imagenet -nGPU 4 -batchSize 128 Epoch: [1][2/10000] Time 1.781 Err 6.9064 Top1-%: 0.78 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][3/10000] Time 1.765 Err 6.9066 Top1-%: 0.00 LR 1e-02 DataLoadingTime 2.181 Epoch: [1][4/10000] Time 1.761 Err 6.9080 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.004 Epoch: [1][5/10000] Time 1.760 Err 6.9089 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.699 Epoch: [1][6/10000] Time 1.763 Err 6.9058 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.004

th main.lua -data ~/imagenet -nGPU 4 -batchSize 256 Epoch: [1][2/10000] Time 2.479 Err 6.9081 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.004 Epoch: [1][3/10000] Time 2.421 Err 6.9074 Top1-%: 0.00 LR 1e-02 DataLoadingTime 3.012 Epoch: [1][4/10000] Time 2.369 Err 6.9066 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.107 Epoch: [1][5/10000] Time 2.368 Err 6.9078 Top1-%: 0.00 LR 1e-02 DataLoadingTime 3.725 Epoch: [1][6/10000] Time 2.368 Err 6.9079 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.005

soumith commented 8 years ago

@chienlinhuang1116 what commit hash are you on?

chienlinhuang1116 commented 8 years ago

Thank you soumith, I think it should be the latest commit hash.

(1) You are right the problem had already fixed. When I work on AWS Ubuntu instances with fbcunn libs installation, it is correct and multi-GPU is much faster than single GPU.

(2) Because the latest imagenet examples seems not working with any fbcunn libs, I tested it on CentOS GPU machines which only installing Torch7 and nn packages. In this case, multi-GPU is still slower than single GPU.

soumith commented 8 years ago

in the case of (2), did you install torch freshly? because the latest nn / cunn + the latest commit hash of this repo does not have the slowness anymore.

chienlinhuang1116 commented 8 years ago

Hi, I install torch freshly using following steps, but multi-GPU is still slower than single GPU on CentOS GPU machines. Do you have any idea?

Thank you.

curl -s https://raw.githubusercontent.com/torch/ezinstall/master/clean-old.sh | bash git clone https://github.com/torch/distro.git ~/torch --recursive cd ~/torch; ./install.sh curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-luajit+torch | PREFIX=~/torch bash

arashno commented 8 years ago

Hi I have the exact same problem. When I use nGPU=2 it is slower than just one GPU. I installed the latest version of Torch, cunn, and this repository. Everything is up to date but still it is slower. Any Ideas? @soumith @chienlinhuang1116 @buttomnutstoast Here are my outputs:

for 1 GPU: ==> doing epoch on training data:
==> online epoch # 1
Epoch: [1][1/9000] Time 0.826 Err 3.8455 Top1-%: 3.12 Topn-%: 7.81 LR 1e-02 DataLoadingTime 5.466
Epoch: [1][2/9000] Time 0.709 Err 3.6204 Top1-%: 17.97 Topn-%: 36.72 LR 1e-02 DataLoadingTime 0.072
Epoch: [1][3/9000] Time 0.701 Err 3.1000 Top1-%: 30.47 Topn-%: 65.62 LR 1e-02 DataLoadingTime 0.009
Epoch: [1][4/9000] Time 0.719 Err 2.7807 Top1-%: 27.34 Topn-%: 61.72 LR 1e-02 DataLoadingTime 0.005
Epoch: [1][5/9000] Time 0.687 Err 2.8056 Top1-%: 25.78 Topn-%: 63.28 LR 1e-02 DataLoadingTime 1.749
Epoch: [1][6/9000] Time 0.715 Err 3.1090 Top1-%: 19.53 Topn-%: 55.47 LR 1e-02 DataLoadingTime 0.005
Epoch: [1][7/9000] Time 0.719 Err 2.7177 Top1-%: 25.78 Topn-%: 65.62 LR 1e-02 DataLoadingTime 0.011
Epoch: [1][8/9000] Time 0.676 Err 2.9563 Top1-%: 17.97 Topn-%: 60.16 LR 1e-02 DataLoadingTime 0.010

for two GPUs:

==> doing epoch on training data:
==> online epoch # 1
Epoch: [1][1/9000] Time 4.474 Err 3.8503 Top1-%: 1.56 Topn-%: 9.38 LR 1e-02 DataLoadingTime 6.425
Epoch: [1][2/9000] Time 2.692 Err 3.5693 Top1-%: 23.44 Topn-%: 42.97 LR 1e-02 DataLoadingTime 0.005
Epoch: [1][3/9000] Time 2.539 Err 3.2223 Top1-%: 28.12 Topn-%: 57.81 LR 1e-02 DataLoadingTime 0.022
Epoch: [1][4/9000] Time 2.511 Err 3.0643 Top1-%: 25.78 Topn-%: 57.81 LR 1e-02 DataLoadingTime 0.019
Epoch: [1][5/9000] Time 2.500 Err 2.8987 Top1-%: 31.25 Topn-%: 60.16 LR 1e-02 DataLoadingTime 0.024
Epoch: [1][6/9000] Time 2.497 Err 3.2392 Top1-%: 23.44 Topn-%: 55.47 LR 1e-02 DataLoadingTime 0.020
Epoch: [1][7/9000] Time 2.494 Err 2.8436 Top1-%: 21.88 Topn-%: 63.28 LR 1e-02 DataLoadingTime 0.023
Epoch: [1][8/9000] Time 2.499 Err 2.7006 Top1-%: 22.66 Topn-%: 65.62 LR 1e-02 DataLoadingTime 0.015
Epoch: [1][9/9000] Time 2.493 Err 2.9153 Top1-%: 17.19 Topn-%: 60.16 LR 1e-02 DataLoadingTime 0.021
Epoch: [1][10/9000] Time 2.503 Err 2.7242 Top1-%: 21.09 Topn-%: 62.50 LR 1e-02 DataLoadingTime 0.019

buttomnutstoast commented 8 years ago

In my experience, you can try: 1) use multi-threads in DataParallelTable 2) call model:getParameter() before model:forward() 3) see if nccl is well installed 4) maybe hardware issues...

arashno commented 8 years ago

@buttomnutstoast Thanks for your answer. I hadn't NCCL, so I installed it. The installation seems fine. Then I replace util.lua:10 by this:

 model = nn.DataParallelTable(1):threads(function()
           require 'cudnn'
         end)

Nothing happened in this case still, it is very slow.

Then I tried to use NCCL by using this:

 model = nn.DataParallelTable(1,true,true):threads(function()
           require 'cudnn'
         end)

But this time the program just stops and does nothing! any idea?

buttomnutstoast commented 8 years ago

Did you install C++ source code of NCCL? The module installed from luarocks is simply an interface.

arashno commented 8 years ago

Yes, I installed the C++ source. I also made using of nccl in two steps, but no difference.

model = nn.DataParallelTable(1,true,true) model:threads(function() require 'cudnn' end)

It just does nothing, it seems that it trapped in a deadlock.

arashno commented 8 years ago

I had an old installation of Torch. It seems that it was causing the problem. When I removed it, it works fine now. Thanks