Open stefan-falk opened 6 years ago
This is expected, T2T uses synchronous training, so 1 step with 4 GPUs trains on 4 times bigger effective batch size. See e.g. this paper.
@martinpopel Thank you for the link. What I understood is that I have to consider the effective throughput of batches, or samples per second, is that right?
If I take my example from above this would mean
Single: 9 gstep/s * 16 bsize = 144 samples/s
Multi: 2 gstep/s * 16 bsize * 4 #gpu = 122 samples/s
It would seem that the additional overhead for this dataset does effectively slow down the training .. am I getting something wrong?
Yes, it seems the training throughput with 4 GPUs is lower than with a single GPU in your case. This is strange, usually it is higher (sublinear speedup). Maybe you have a slow interconnection between the GPUs (NVLink). I am also not familiar with Librispeech. However, larger batch size may lead to faster convergence on the dev-set, as explained in the paper I linked.
@martinpopel In the meantime I ran another small experiment: Starting from scratch I had a loss of 1.7 after 2 minutes using one GPU - the same was true for 4 GPUs after 9 minutes. That's not a very scientific test but it seems that Multi-GPU does not help me here - for whatever reason. I'll check into that interconnection topic just to be sure.
any update on using multi-GPUs?
@lkluo Well, what I can say, based on observations, is that 4 GPUs let the model converge faster and overall better. I guess it is due to a larger batch size. From my experience I'd recommend using multiple GPUs and to use larger batch sizes.
@lkluo Well, what I can say, based on observations, is that 4 GPUs let the model converge faster and overall better. I guess it is due to a larger batch size. From my experience I'd recommend using multiple GPUs and to use larger batch sizes.
How long did it take you to reach SOTA on 4 GPUs?
same problems
Description
I am training a
Transformer
model on theLibrispeech
dataset using 4 GPUs with 8 CPU-cores.I have tested the following:
Single-GPU
Multi-GPU
Both scripts are working. The training starts and on the surface everything looks okay. However, I am getting a
global_step/sec
or just ~2 steps for Multi-GPU, compared to ~9 steps for Single-GPU.Shouldn't I see a speedup using multiple GPUs? If so: What might be the problem here? Can I trust the log output?
Environment information