ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper
MIT License
5.41k stars 1.29k forks source link

Get the speaker id out from wave file #242

Open arpitbaheti opened 7 years ago

arpitbaheti commented 7 years ago

Hi,

Does anyone know how we can use wave-net implementation to actually return the speaker id on giving wave file as input? Instead of generating the wave file for a given speaker.

Thanks, Arpit

belevtsoff commented 7 years ago

@arpitbaheti In the original paper deepmind mentions using wavenet for speech recognition. Try using that architecture to do any sort of classification tasks on your raw waveforms. It's basically an avg pool layer and then a bunch of normal conv layers on top. We tried it for F0 estimation and it worked really good

arpitbaheti commented 7 years ago

Thanks @belevtsoff for your answer. In the original paper they said that " adding a mean pooling layer after dilated convolutions that aggregated the activations to coarser frames spanning 10 ms (160x down-sampling). " So what exactly they have done with average pooling (reducing the input dimension to particular value)? I have tried the same architecture with average pool1d on skip outputs followed by two conv1d layer. After that softmax for target speaker (which is single integer represent the id of the speaker) and predicted output. Problem is size of logit and target is not matching. you said you have tried F0 estimation, can you please let me know how did you do that?

haoeryue commented 7 years ago

I have the same question. Does the initiator have solved the problem?

arpitbaheti commented 7 years ago

I have tried many things, but as wavenet works on samples and we can't predict speaker per samples, but i have modified https://github.com/buriburisuri/speech-to-text-wavenet to return the speaker ID with MFCC as input feature, Which doesn't work well, Any other network known to work for speaker recognition?? (Any RNN/LSTM)

haoeryue commented 7 years ago

@arpitbaheti All right. Actually, I wanna try to classify or identify the audio samples by using time-domain features directly rather than other transferred features such as MFCCs, Spectrogram, etc. Have you ever tried some other methods to do it (Just deal with waveform just like wavenet)?