Open secretsauceai opened 3 years ago
see how we do it in ovos https://github.com/OpenVoiceOS/ovos-core/tree/dev/mycroft/listener
mic.py and silence.py are the relevant files
basically between using noise threshold with some magic numbers and an optional VAD model we get pretty good silence/speech detection
re VAD plugins, silero seems to work best according to benchmarks but is painful to install at times, webrtcvad seems to work about anywhere and i personally don't notice a difference in performance/accuracy
i think you should be able to adapt the silence.py file and automatically have support for our plugins, if you adapt mic.py you get support for the whole wake word stack. Those components are pretty much standalone and you can add ovos-core to requirements.txt and import them directly
Having better silence detection would aid in chopping up audio files containing wake word information to reduce false positives.
Currently to make sure individual files capture only aspects of the wake word recordings, I chop them by n +2, where n is the number of syllables in the wake word. This works, however it misses a lot more combinations of sounds (ie Jarvis in 'hey Jarvis' would not be completely contained).
I tried some experiments with silence removal myself based on this stackoverflow question. However the threshold must be manually provided, I couldn't find a satisfactory threshold, perhaps a dynamic threshold is needed?
Here is an interesting code snippet to check if it works better.
Solution
However I think for now, the easiest solution is to add a feature into the wake word recording python script to let people add in such stuff themselves. This level of recording (such as using 'Jarvis' as a not-wake audio) was impossible on earlier models before the data generation methods were perfected.
This is the easiest and most viable solution. But it would be cool to be able to chop up audio files automatically for syllables and even more complex sounds in the future.
Example
I want my wake word 'hey Jarvis' to work, but not also for just 'Jarvis'. Therefore I add in when prompted for extra input on not-wake-words 'Jarvis' with 2 recordings (one for training one for test, which will be generated further anyway).