philipperemy / deep-speaker

Deep Speaker: an End-to-End Neural Speaker Embedding System.
MIT License
905 stars 241 forks source link

Length of audio #61

Closed Tomas1337 closed 4 years ago

Tomas1337 commented 4 years ago

Hi,

I'm trying to use your model to create a real-time voice identification system.

Correct me if I'm wrong but when you convert the audio into mfcc, you use the whole audio to construct the mfcc and then randomly sample a size of NUM_FRAMES

I'm now onto investigating which sample size I should convert to pass into the mfcc fbank conversion. I haven't done extensive testing yet but upon initial trial, 50,000 frames passed onto the fbank() function works well. This figure was pretty much a shot in the dark.

Would you have any advice as to the minimum required audio length?

philipperemy commented 4 years ago

@Tomas1337

Correct me if I'm wrong but when you convert the audio into mfcc, you use the whole audio to construct the mfcc and then randomly sample a size of NUM_FRAMES

Yes that's correct. We take a few seconds of this sound.

There's no magical number but I guess at least one to two seconds of speech should be enough to create a real-time voice identification system. The more the better of course ;)

Tomas1337 commented 4 years ago

Just to update you, it seems 2 second works quite well! With 1 second its not accurate. Hope to share my code soon

philipperemy commented 4 years ago

@Tomas1337 thanks for sharing! Very helpful. I think my implementation has 1.6 seconds from what I remember.

philipperemy commented 4 years ago

Defined here: https://github.com/philipperemy/deep-speaker/blob/ad6b66a6bdaa1cb2f1cea5b6bf02943f5e326196/constants.py#L17