Closed Suraj6198 closed 2 years ago
Unfortunately, it takes a lot of memory to learn the speech recognition model.
In particular, the transducers model needs more.
Hi @sooftware , Thanks for your reply. Do you have idea which GPU would be good to handle this Transformer Transducer model?
Of course, the bigger the better. If several A100s are possible... That would be good.
Or it would be a good idea to reduce the mfcc coefficient and vocab size.
Transducers will use a lot of memory depending on the vocab size.
Yes, reducing the tensors dimension by various ways will work. But the thing is, I'm trying to achieve the same WER as mentioned in paper. And reducing dimensions may effect the results.
Then, I think you have no choice but to use a lot of GPUs. 😂
Tried to train Transformer Transducer model on Librispeech "train-clean-100" dataset on 16GB GPU. But getting "CUDA out of memory" error . Also tried by splitting various layers on 3 GPUs, 16GB each, but getting same thing. And error is pointing to ''joint'' layer, maybe because large size of tensors in joint layer.
Details: Number of MFCCs = 128 Timesteps =512 Vocabulary size = 21800 ( tried to reduce it to 5K, but getting same thing) Embedding layer dimension =Vocab_size*512 Audio Encoder = TransformerTransducerEncoder Label Encoder =TransformerTransducerDecoder loss = RNNTLoss
If anyone trained Transformer Transducer successfully and able to get results comparable with https://arxiv.org/abs/2002.02562 , please let me know the number of accelerators and their respective memory capacity.