resemble-ai / Resemblyzer

A python package to analyze and compare voices with deep learning
Apache License 2.0
2.66k stars 419 forks source link

How many audio utterances per speaker to get a good recognition? #54

Open ADD-eNavarro opened 3 years ago

ADD-eNavarro commented 3 years ago

Hello, and first things first: thank you so much for such a great work.

I have been testing your tool with a bunch (28) of voices to consider using it in a voice recognition system. Btw, it works great. Only, after a few tests it's yet not quite clear to me how many audio files per speaker, and of which length, would be needed to assure a good result identifying the speaker.

Since the production system will have to deal with quite short commands (2-3 words) I've tried demo1 with 3 short audios. Results aren't very good:

Comparación 3 cortos

Then, wondering if longer audios would be better, I used 3 long audios (20-25 words), but improvements -if I get it right- happen in speaker identification but not so much in utterances.

Comparación 3 largos

Other things I've tried are using some more audios (8 short and the 3 longs ones), whici is already better:

Comparación de voces

The question here is: how many audios, and of what kind, would you recommend to get good per-utterance results (since false positives are to be avoided)?

Bonus question, since my enterprise works with .Net Core, I have exported your pretrained model to ONNX, and face now the preprocessing of audio to feed it. Could you recommend any code for the preprocessing part?