jefflai108 / Contrastive-Predictive-Coding-PyTorch

Contrastive Predictive Coding for Automatic Speaker Verification
MIT License
472 stars 96 forks source link

Feed entire input to encoder?? #17

Open NeteeraAF opened 3 years ago

NeteeraAF commented 3 years ago

I see in your implementation that you feed entire signal into the encoder, while the paper has noted that each timestemp should be insert seperatly. When you feed the entire signal into the encoder, you get some overlapping features with the Conv kernel (except for the case that the stride equal to the kernel size).

Why did you implement like that? do you think it does not matter ?

Thanks!

altomanscott commented 2 years ago

I have this doubt as well. I notice that in the paper the training inputs are segmented into small chunks with each chunk feeding into the encoder to the the feature representation z_t which will then feed into g_ar(GRU). The context C_t from the gar are then used to predict feature representation z{t+k} from the future time frame. I don't know if I have the correct understanding of the paper or not.

In this implementation, I think the entire signal are fed into the encoder, the produced feature representation are split into two, the first part fed to the g_ar (GRU), then the g_ar learns to predict the second part of the representation features.

I believe that these two are different models and concepts which would bring different results. I really hope that the author could elaborate on point.

Thanks!