Closed chienlinhuang1116 closed 8 years ago
We have confronted similar problems that using multi-GPU is much slower than single GPU
OK, we find that package contributors have already fixed the problem which is mentioned in this post
hi, sorry about that, yes it should be fixed in the latest commit.
Hi, I tested it and results still showed that “nGPU=4” is slower than “nGPU=1”. Do you have any comment? Thank you.
th main.lua -data ~/imagenet -nGPU 1 -batchSize 128 Epoch: [1][2/10000] Time 0.844 Err 6.9071 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][3/10000] Time 0.845 Err 6.9084 Top1-%: 0.00 LR 1e-02 DataLoadingTime 3.041 Epoch: [1][4/10000] Time 0.845 Err 6.9095 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.051 Epoch: [1][5/10000] Time 0.845 Err 6.9092 Top1-%: 0.00 LR 1e-02 DataLoadingTime 2.557 Epoch: [1][6/10000] Time 0.843 Err 6.9095 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.003
th main.lua -data ~/imagenet -nGPU 4 -batchSize 128 Epoch: [1][2/10000] Time 1.781 Err 6.9064 Top1-%: 0.78 LR 1e-02 DataLoadingTime 0.002 Epoch: [1][3/10000] Time 1.765 Err 6.9066 Top1-%: 0.00 LR 1e-02 DataLoadingTime 2.181 Epoch: [1][4/10000] Time 1.761 Err 6.9080 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.004 Epoch: [1][5/10000] Time 1.760 Err 6.9089 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.699 Epoch: [1][6/10000] Time 1.763 Err 6.9058 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.004
th main.lua -data ~/imagenet -nGPU 4 -batchSize 256 Epoch: [1][2/10000] Time 2.479 Err 6.9081 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.004 Epoch: [1][3/10000] Time 2.421 Err 6.9074 Top1-%: 0.00 LR 1e-02 DataLoadingTime 3.012 Epoch: [1][4/10000] Time 2.369 Err 6.9066 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.107 Epoch: [1][5/10000] Time 2.368 Err 6.9078 Top1-%: 0.00 LR 1e-02 DataLoadingTime 3.725 Epoch: [1][6/10000] Time 2.368 Err 6.9079 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.005
@chienlinhuang1116 what commit hash are you on?
Thank you soumith, I think it should be the latest commit hash.
(1) You are right the problem had already fixed. When I work on AWS Ubuntu instances with fbcunn libs installation, it is correct and multi-GPU is much faster than single GPU.
(2) Because the latest imagenet examples seems not working with any fbcunn libs, I tested it on CentOS GPU machines which only installing Torch7 and nn packages. In this case, multi-GPU is still slower than single GPU.
in the case of (2), did you install torch freshly? because the latest nn / cunn + the latest commit hash of this repo does not have the slowness anymore.
Hi, I install torch freshly using following steps, but multi-GPU is still slower than single GPU on CentOS GPU machines. Do you have any idea?
Thank you.
curl -s https://raw.githubusercontent.com/torch/ezinstall/master/clean-old.sh | bash git clone https://github.com/torch/distro.git ~/torch --recursive cd ~/torch; ./install.sh curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-luajit+torch | PREFIX=~/torch bash
Hi I have the exact same problem. When I use nGPU=2 it is slower than just one GPU. I installed the latest version of Torch, cunn, and this repository. Everything is up to date but still it is slower. Any Ideas? @soumith @chienlinhuang1116 @buttomnutstoast Here are my outputs:
for 1 GPU:
==> doing epoch on training data:
==> online epoch # 1
Epoch: [1][1/9000] Time 0.826 Err 3.8455 Top1-%: 3.12 Topn-%: 7.81 LR 1e-02 DataLoadingTime 5.466
Epoch: [1][2/9000] Time 0.709 Err 3.6204 Top1-%: 17.97 Topn-%: 36.72 LR 1e-02 DataLoadingTime 0.072
Epoch: [1][3/9000] Time 0.701 Err 3.1000 Top1-%: 30.47 Topn-%: 65.62 LR 1e-02 DataLoadingTime 0.009
Epoch: [1][4/9000] Time 0.719 Err 2.7807 Top1-%: 27.34 Topn-%: 61.72 LR 1e-02 DataLoadingTime 0.005
Epoch: [1][5/9000] Time 0.687 Err 2.8056 Top1-%: 25.78 Topn-%: 63.28 LR 1e-02 DataLoadingTime 1.749
Epoch: [1][6/9000] Time 0.715 Err 3.1090 Top1-%: 19.53 Topn-%: 55.47 LR 1e-02 DataLoadingTime 0.005
Epoch: [1][7/9000] Time 0.719 Err 2.7177 Top1-%: 25.78 Topn-%: 65.62 LR 1e-02 DataLoadingTime 0.011
Epoch: [1][8/9000] Time 0.676 Err 2.9563 Top1-%: 17.97 Topn-%: 60.16 LR 1e-02 DataLoadingTime 0.010
for two GPUs:
==> doing epoch on training data:
==> online epoch # 1
Epoch: [1][1/9000] Time 4.474 Err 3.8503 Top1-%: 1.56 Topn-%: 9.38 LR 1e-02 DataLoadingTime 6.425
Epoch: [1][2/9000] Time 2.692 Err 3.5693 Top1-%: 23.44 Topn-%: 42.97 LR 1e-02 DataLoadingTime 0.005
Epoch: [1][3/9000] Time 2.539 Err 3.2223 Top1-%: 28.12 Topn-%: 57.81 LR 1e-02 DataLoadingTime 0.022
Epoch: [1][4/9000] Time 2.511 Err 3.0643 Top1-%: 25.78 Topn-%: 57.81 LR 1e-02 DataLoadingTime 0.019
Epoch: [1][5/9000] Time 2.500 Err 2.8987 Top1-%: 31.25 Topn-%: 60.16 LR 1e-02 DataLoadingTime 0.024
Epoch: [1][6/9000] Time 2.497 Err 3.2392 Top1-%: 23.44 Topn-%: 55.47 LR 1e-02 DataLoadingTime 0.020
Epoch: [1][7/9000] Time 2.494 Err 2.8436 Top1-%: 21.88 Topn-%: 63.28 LR 1e-02 DataLoadingTime 0.023
Epoch: [1][8/9000] Time 2.499 Err 2.7006 Top1-%: 22.66 Topn-%: 65.62 LR 1e-02 DataLoadingTime 0.015
Epoch: [1][9/9000] Time 2.493 Err 2.9153 Top1-%: 17.19 Topn-%: 60.16 LR 1e-02 DataLoadingTime 0.021
Epoch: [1][10/9000] Time 2.503 Err 2.7242 Top1-%: 21.09 Topn-%: 62.50 LR 1e-02 DataLoadingTime 0.019
In my experience, you can try: 1) use multi-threads in DataParallelTable 2) call model:getParameter() before model:forward() 3) see if nccl is well installed 4) maybe hardware issues...
@buttomnutstoast Thanks for your answer. I hadn't NCCL, so I installed it. The installation seems fine. Then I replace util.lua:10 by this:
model = nn.DataParallelTable(1):threads(function()
require 'cudnn'
end)
Nothing happened in this case still, it is very slow.
Then I tried to use NCCL by using this:
model = nn.DataParallelTable(1,true,true):threads(function()
require 'cudnn'
end)
But this time the program just stops and does nothing! any idea?
Did you install C++ source code of NCCL? The module installed from luarocks is simply an interface.
Yes, I installed the C++ source. I also made using of nccl in two steps, but no difference.
model = nn.DataParallelTable(1,true,true) model:threads(function() require 'cudnn' end)
It just does nothing, it seems that it trapped in a deadlock.
I had an old installation of Torch. It seems that it was causing the problem. When I removed it, it works fine now. Thanks
Hi,
Do you know why the results showed that “nGPU=4” is still slower than “nGPU=1”?
-nGPU 1 -batchSize 128 Epoch: [1][2/10000] Time 0.835 Err 6.9080 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.005 Epoch: [1][3/10000] Time 0.834 Err 6.9073 Top1-%: 0.00 LR 1e-02 DataLoadingTime 3.233 Epoch: [1][4/10000] Time 0.834 Err 6.9104 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.003 Epoch: [1][5/10000] Time 0.834 Err 6.9075 Top1-%: 0.00 LR 1e-02 DataLoadingTime 2.871 Epoch: [1][6/10000] Time 0.836 Err 6.9064 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.003 Epoch: [1][7/10000] Time 0.833 Err 6.9077 Top1-%: 0.00 LR 1e-02 DataLoadingTime 2.776
-nGPU 4 -batchSize 512 Epoch: [1][2/10000] Time 3.915 Err 6.9070 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.003 Epoch: [1][3/10000] Time 4.449 Err 6.9081 Top1-%: 0.00 LR 1e-02 DataLoadingTime 11.843 Epoch: [1][4/10000] Time 3.906 Err 6.9079 Top1-%: 0.20 LR 1e-02 DataLoadingTime 0.005 Epoch: [1][5/10000] Time 3.898 Err 6.9078 Top1-%: 0.20 LR 1e-02 DataLoadingTime 7.108 Epoch: [1][6/10000] Time 3.902 Err 6.9079 Top1-%: 0.20 LR 1e-02 DataLoadingTime 0.005
-nGPU 8 -batchSize 1024 Epoch: [1][2/10000] Time 7.186 Err 6.9080 Top1-%: 0.20 LR 1e-02 DataLoadingTime 0.006 Epoch: [1][3/10000] Time 6.947 Err 6.9079 Top1-%: 0.10 LR 1e-02 DataLoadingTime 24.149 Epoch: [1][4/10000] Time 6.724 Err 6.9080 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.007 Epoch: [1][5/10000] Time 7.773 Err 6.9080 Top1-%: 0.10 LR 1e-02 DataLoadingTime 16.892 Epoch: [1][6/10000] Time 6.731 Err 6.9081 Top1-%: 0.00 LR 1e-02 DataLoadingTime 0.007
Thank you.
Chien-Lin