Seeking advice on parameter configuration and settings for large-scale ASR models

That's nice! For each doubling of parameter size, I would probably:

increase --feedforward-dim and --encoder-dim value by around a factor of sqrt(2), e.g. increase 512,768,1024,1536,1024,768 to 768,1024,1536,2048,1536,768. This is the thing we mostly change when we change the model size.
You may also want to increase query-head-dim and value-head-dim a bit for the larger models, and also the num-heads can be increased a little bit, but don't increase the num-heads too much as it will increase memory requirements if sequences are long. I.e. you probably want to increase by less than the encoder and feedforward dims.
- joiner-dim and decoder-dim could perhaps be increased slightly, probably not to much more than 768, if you are using RNN-T and want a strong joiner. I also recomment to increase the num-encoder-layers very slightly... it's currently 2,2,3,4,3,2... don't increase it by as much as you increase the dimensions though.
- You probably don't need to increase encoder-unmasked-dim.
- You could increase pos-dim and pos-head-dim slightly also, just on general principles of increasing things although this will probably make little difference.
- If you are feeling brave and have time to experiment you could also try adding a central more-downsampled layer, e.g. change from "1,2,4,8,4,2" to "1,2,4,8,16,8,4,2". That will require changing other comma-separated dims according to the patterns you can see. The good thing about this is that it will require very little extra memory during training but will increase the parameters by a lot. Caution: in our 1000h LIbrispeech setup, this made the results worse; but I suspect this was overfitting.

k2-fsa / icefall

Seeking advice on parameter configuration and settings for large-scale ASR models #1596