Closed Tomas1337 closed 4 years ago
@Tomas1337
Correct me if I'm wrong but when you convert the audio into mfcc, you use the whole audio to construct the mfcc and then randomly sample a size of NUM_FRAMES
Yes that's correct. We take a few seconds of this sound.
There's no magical number but I guess at least one to two seconds of speech should be enough to create a real-time voice identification system. The more the better of course ;)
Just to update you, it seems 2 second works quite well! With 1 second its not accurate. Hope to share my code soon
@Tomas1337 thanks for sharing! Very helpful. I think my implementation has 1.6 seconds from what I remember.
Hi,
I'm trying to use your model to create a real-time voice identification system.
Correct me if I'm wrong but when you convert the audio into mfcc, you use the whole audio to construct the mfcc and then randomly sample a size of NUM_FRAMES
I'm now onto investigating which sample size I should convert to pass into the mfcc fbank conversion. I haven't done extensive testing yet but upon initial trial, 50,000 frames passed onto the fbank() function works well. This figure was pretty much a shot in the dark.
Would you have any advice as to the minimum required audio length?