snakers4 / silero-vad

Silero VAD: pre-trained enterprise-grade Voice Activity Detector
MIT License
4.48k stars 437 forks source link

'voxlingua107' for Language Classifier #71

Closed doublex closed 3 years ago

doublex commented 3 years ago

This project has audio-samples for 107 languages: http://bark.phon.ioc.ee/voxlingua107/ Would be great to improve the Language Classifier

snakers4 commented 3 years ago

Many thanks, I saw it some time ago, we are working on it albeit with a low priority

Theoretically we will even be able to make a vad out of it as well

snakers4 commented 3 years ago

done in https://github.com/snakers4/silero-vad/commit/395885b06b408b9ca0b84dcf05a42d8e8be59153 more data was used probably will exclude some artificial or unspoken languages and train a bigger model

bytosaur commented 3 years ago

wow! this is great news.... I have been working on a language classifier using the common voice dataset, but I found it pretty hard to get a satisfying validation accuracy even on four languages. What is your validation accuracy?

I have been using 5s samples, STFT and classified them the small ATTRNN used in here Do you have a tip on how to solve this task?

snakers4 commented 3 years ago

hard to get a satisfying validation accuracy even on four language What is your validation accuracy?

We had 99%+ provided they were quite different (en, ru, de, es) Though we just did random split, without regarding the speakers The datasets are large enough not to care

For 100+ languages there are still some unresolved issues, i.e. English having low accuracy and mutually intelligible languges having orders of magnitude differences in available data

Do you have a tip on how to solve this task?

Just use our models If you need higher quality for some particular cases - please dm for commercial inquiries

bytosaur commented 3 years ago

thanks for the quick reply! When i did not care for speakers my acc was about 95% but failed hard in a real life scenario. After i fixed the speakers issue, acc drastically decreased (85%) but real life performance is almost OK now. I used 30k samples per language. I noticed there are flawed samples from cutting with Acoustic Audio Detection (auditok), so I checked for a VAD and ended up here. Great work! I ll try to use your VAD for cleaner cuts on the samples.

snakers4 commented 3 years ago

I used 30k samples per language. After i fixed the speakers issue, acc drastically decreased (85%) but real life performance is almost OK now.

also in domain / out of domain may be an issue if your dataset is not diverse enough (not enough augs) as for language classifier, most likely we will update it soon, there are some obvious improvements