jefflai108 / Contrastive-Predictive-Coding-PyTorch

Contrastive Predictive Coding for Automatic Speaker Verification
MIT License
478 stars 98 forks source link

How combine MFCC and CPCfeatures #3

Closed cyl250 closed 5 years ago

cyl250 commented 5 years ago

Thank you for sharing your code, I have meet some problem. When we use CPC, it is [128,256] but mfcc is [frame,39], as you result, I wonder how to combine it in [frame, 39 + 256] dims. Thanks again

jefflai108 commented 5 years ago

hi @cyl250 It is common to combine features by simply concatenate them (along the feature dimension).

The CPC feature is [num_framess, 256] and MFCC is [num_frames, 39]. Concatenating them would give [num_frames, 256+39].

cyl250 commented 5 years ago

sorry, I have some trouble to understant After model.predict() the CPC features is in [128,256] dims. Do I need change the numbers of node of network to fix mode.prdict() return a [num_frames,256] vectors?

jefflai108 commented 5 years ago

128 is the number of frames during TRAINING. In the CPC training, random chunks from the raw waveform are selected and input to the encoder. For example, a random chunk of 20480 data points corresponds to 1.28 seconds, or 128 frames (16k Hz audio).

During inference, you should input the entire utterance instead of the chunks. This will give you the correct number of frames instead of 128.

cyl250 commented 5 years ago

Thank you very much, I got it. It helps a lot. Thank you again.

------------------ 原始邮件 ------------------ 发件人: "Cheng-I Jeff Lai"notifications@github.com; 发送时间: 2019年7月22日(星期一) 晚上8:30 收件人: "jefflai108/Contrastive-Predictive-Coding-PyTorch"Contrastive-Predictive-Coding-PyTorch@noreply.github.com; 抄送: "陈雨龙"936229102@qq.com;"Mention"mention@noreply.github.com; 主题: Re: [jefflai108/Contrastive-Predictive-Coding-PyTorch] How combineMFCC and CPCfeatures (#3)

128 is the number of frames during TRAINING. In the CPC training, random chunks from the raw waveform are selected and input to the encoder. For example, a random chunk of 20480 data points corresponds to 1.28 seconds, or 128 frames (16k Hz audio).

During inference, you should input the entire utterance instead of the chunks. This will give you the correct number of frames instead of 128.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.