We are planning using icefall and zipformer to train large-scale ASR models and will likely conduct multiple experiments with three different sizes: 600M, 1B (similar to Whisper Large), and 5B. We are seeking advice on parameter configuration and settings for these models. Our goal is to achieve the best possible performance. We have observed that the largest publicly available model has approximately 140M parameters in icefall ,and Our training data consists of several million hours of audio.
Iany good suggestions for this? @danpovey @pingfengluo @nshmyrev
That's nice!
For each doubling of parameter size, I would probably:
increase --feedforward-dim and --encoder-dim value by around a factor of sqrt(2), e.g. increase
512,768,1024,1536,1024,768 to 768,1024,1536,2048,1536,768. This is the thing we mostly change when we change the model size.
You may also want to increase query-head-dim and value-head-dim a bit for the larger models, and also the num-heads can be increased a little bit, but don't increase the num-heads too much as it will increase memory requirements if sequences are long. I.e. you probably want to increase by less than the encoder and feedforward dims.
joiner-dim and decoder-dim could perhaps be increased slightly, probably not to much more than 768, if you are using RNN-T and want a strong joiner. I also recomment to increase the num-encoder-layers very slightly... it's currently 2,2,3,4,3,2... don't increase it by as much as you increase the dimensions though.
You probably don't need to increase encoder-unmasked-dim.
You could increase pos-dim and pos-head-dim slightly also, just on general principles of increasing things although this will probably make little difference.
If you are feeling brave and have time to experiment you could also try adding a central more-downsampled layer, e.g. change from "1,2,4,8,4,2" to "1,2,4,8,16,8,4,2". That will require changing other comma-separated dims according to the patterns you can see. The good thing about this is that it will require very little extra memory during training but will increase the parameters by a lot. Caution: in our 1000h LIbrispeech setup, this made the results worse; but I suspect this was overfitting.
We are planning using icefall and zipformer to train large-scale ASR models and will likely conduct multiple experiments with three different sizes: 600M, 1B (similar to Whisper Large), and 5B. We are seeking advice on parameter configuration and settings for these models. Our goal is to achieve the best possible performance. We have observed that the largest publicly available model has approximately 140M parameters in icefall ,and Our training data consists of several million hours of audio.
Iany good suggestions for this? @danpovey @pingfengluo @nshmyrev