Hello, and first things first: thank you so much for such a great work.
I have been testing your tool with a bunch (28) of voices to consider using it in a voice recognition system. Btw, it works great.
Only, after a few tests it's yet not quite clear to me how many audio files per speaker, and of which length, would be needed to assure a good result identifying the speaker.
Since the production system will have to deal with quite short commands (2-3 words) I've tried demo1 with 3 short audios. Results aren't very good:
Then, wondering if longer audios would be better, I used 3 long audios (20-25 words), but improvements -if I get it right- happen in speaker identification but not so much in utterances.
Other things I've tried are using some more audios (8 short and the 3 longs ones), whici is already better:
The question here is: how many audios, and of what kind, would you recommend to get good per-utterance results (since false positives are to be avoided)?
Bonus question, since my enterprise works with .Net Core, I have exported your pretrained model to ONNX, and face now the preprocessing of audio to feed it. Could you recommend any code for the preprocessing part?
Hello, and first things first: thank you so much for such a great work.
I have been testing your tool with a bunch (28) of voices to consider using it in a voice recognition system. Btw, it works great. Only, after a few tests it's yet not quite clear to me how many audio files per speaker, and of which length, would be needed to assure a good result identifying the speaker.
Since the production system will have to deal with quite short commands (2-3 words) I've tried demo1 with 3 short audios. Results aren't very good:
Then, wondering if longer audios would be better, I used 3 long audios (20-25 words), but improvements -if I get it right- happen in speaker identification but not so much in utterances.
Other things I've tried are using some more audios (8 short and the 3 longs ones), whici is already better:
The question here is: how many audios, and of what kind, would you recommend to get good per-utterance results (since false positives are to be avoided)?
Bonus question, since my enterprise works with .Net Core, I have exported your pretrained model to ONNX, and face now the preprocessing of audio to feed it. Could you recommend any code for the preprocessing part?