k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
884 stars 286 forks source link

GPU recommendation #342

Open armusc opened 2 years ago

armusc commented 2 years ago

Hi

I have not seen a GPU recommendation thread or a README on the subject (there might be one somewhere?) what kind of GPU would you recommend for normal trainings, I mean moderately sized acoustic models? I'm currently using 2 Nvidia GeForce RTX 3090 with 24GB of memory; with Kaldi I can do with less than 5GB. Number of GPUs and memory size are the two critical parameters, here; for a few thousands hours of speech (given that multi-style augmentation is desirable), I would need more than a day for one epoch with 2 GPUs, I'm using ~20M parameters model size to better compare with similarly large Kaldi chain models, and often I get close to consuming all GPU memory (I use encoder-decoder ctc-attention with confomer layers) GeForce RTX 3090 have a more reasonable price than Tesla V100 32 GB, which are way too expensive it looks to me the latest SOTA in speech recognition comes with a greater cost in computational resources (which Ok, it's expected but I wonder whether (very) good -old kaldi is good enough for my pockets...)

danpovey commented 2 years ago

Our GPUs have 32GB, but I think 24GB is close enough. You might have to decrease the --max-duration a bit. In our most recent work with the pruned_transducer_stateless2 setup, we have been running with --use-fp16=True, i.e. with half precision float, which makes it use less memory (or alternatively you use larger --max-duration). .. definitely 5GB is not enough.

entn-at commented 2 years ago

Just to add a data point: When training librispeech pruned_transducer_stateless5 setup on 4xRTX2080Ti (11GB) with --use-fp16=True, the largest --max-duration I can use is 150.

danpovey commented 2 years ago

OK. It's a shame that it can't use a larger duration, but I estimate that it won't affect the training or if it does, the effect might be slightly positive/good WER-wise. Also the speed difference from that kind of batch size change is not huge.

funboarder13920 commented 2 years ago

Hello, on librispeech (streaming rnn-t emformer), the model does not converge when using a small max-duration (under 100), even with smaller learning rates. I get the same behavior when training on my homemade dataset.

image red: max-duration 100 blue: max-duration 60

I would advise to stay above 100 for max-duration and to try increasing it if the model does not seem to converge as other parameters didn't seem to help in a too small batch size scenario.

armusc commented 2 years ago

in encoder-decoder conformer ctc-attention loss, I have to use 35 seconds max duration per batch it's an internal dataset (765h * 8 augmented copies) the model does converge, though it takes ~34 hours for each epoch on 2 RTX 3090

danpovey commented 2 years ago

You may have to reduce the initial learning rate if it is not converging.

danpovey commented 2 years ago

If it's still not converging, try to double model_warm_step from 3k to 6k. I advise not to reduce initial learning rate by more than about sqrt(2).

danpovey commented 2 years ago

.. you don't have to leave it running very long to tell if it's converging. It should get a good loss within model_warm_step, or not at all.