Generalization - Githubissues

Your training data (commonvoice or librispeech) uses sentences, sequence of words compared with single words of Speech Commands Dataset used here. If you look at the audio waves, the silence gaps between the words are relatively long and obvious to detect. I would imagine that RNN picks up "silence gap" as "space" in sentences pretty soon. Alternatively (probably better) an attention layer would learn to focus on single words quickly. So training with sequence of words is not much different from using single words in theory, but in practice, it might take much longer to train.

You will need to increase the model size, especially RNN size, adding extra RNN layers or attention layer. Start with the short sentences and gradually increases the sentence length.

The key to recognise new words is for the model to learn a good sound-character(s) mapping. You could play with the granularity (e.g. syllables or phonemes) of the mapping, depending on your data size.

Good luck:-)

huschen / kaggle_speech_recognition

Generalization #1