Closed dpny518 closed 6 years ago
Your training data (commonvoice or librispeech) uses sentences, sequence of words compared with single words of Speech Commands Dataset used here. If you look at the audio waves, the silence gaps between the words are relatively long and obvious to detect. I would imagine that RNN picks up "silence gap" as "space" in sentences pretty soon. Alternatively (probably better) an attention layer would learn to focus on single words quickly. So training with sequence of words is not much different from using single words in theory, but in practice, it might take much longer to train.
You will need to increase the model size, especially RNN size, adding extra RNN layers or attention layer. Start with the short sentences and gradually increases the sentence length.
The key to recognise new words is for the model to learn a good sound-character(s) mapping. You could play with the granularity (e.g. syllables or phonemes) of the mapping, depending on your data size.
Good luck:-)
If I train the data for example of commonvoice or librispeech data(lots of data but not exact same phrases repeated over and over), how would I do keyword spotting for a new word like huschen
"Generalisation. The c-model is able to learn unseen words, e.g. recognizing 'night' from learning 'NIne' and 'rIGHT', recognizing 'follow' as 'foow' (missing 'l' sound) from learning 'Four' 'dOWn' and 'nO'"