Training on multi-gpu very slow

mravanelli / pytorch-kaldi

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

2.37k stars 446 forks source link

Training on multi-gpu very slow #237

Open sun-peach opened 4 years ago

sun-peach commented 4 years ago

I am training my ASR model with pytorch-kaldia, and notice the training time is very slow, 10% of 1 chunk is 10 mins. I have 10 chunk and will run to 15 epochs, which leads to about 10 days training.

My dataset has about 2k hours audio, and I split them in 10 chunks. I use multi-gpu, my GPU memory is 32G. I am following cfg/librispeech_liGRU_fmllr.cfg, except I use Adam instead and 4 liGRU layers (instead of the 5 layers set originally).

I have searched in the "Issues" and learnt that the developers have already optimized the multi-GPU training process. But I still see my GPU utils is around 30%, which means not fully used. I would like to know is there anyway that I can speed up the training a little bit?

Thank you very much!

TParcollet commented 4 years ago

Hi ! So this is quite a hard problem in itself. 2K hours is a lot, and 10 days of training on a Single GPU sounds reasonable to me. You can 1: Consider something else than LiGRU (LSTM and GRU are faster thanks to CUDNN, but they also give worse performances). 2. Multi-GPU with DataParallel is bottlenecked by Python, and the only solution is to go with DistributedDataParallel (Which is impossible to adapt for pytorch-Kaldi I think). So you should just do mutligpu=true and then do batch_size = max_batch_size_for_one_gpu * number_of_gpu_you_got. Training time doesn't scale linearly with the number of GPU but you can easily go down to 3 days with 4 GPUs.

sun-peach commented 4 years ago

Thank you. I use the setting listed below:

use_cuda=True
multi_gpu=True
N_epochs_tr=15
N_chunks=50
batch_size_train=16
max_seq_length_train=1500
increase_seq_length_train=True
start_seq_len_train=300
multply_factor_seq_len_train=5
batch_size_valid=8
max_seq_length_valid=1400

It seems that it will take about 12 days. (My sequence length is long). If you think all my setting is reasonable, then I will just wait.

TParcollet commented 4 years ago

How many GPUs do you have ?

sun-peach commented 4 years ago

4GPUs.