joonson / syncnet_trainer

Disentangled Speech Embeddings using Cross-Modal Self-Supervision
MIT License
155 stars 26 forks source link

Negative audio samples for M way matching #7

Open ak-7 opened 4 years ago

ak-7 commented 4 years ago

Where are the negative audio samples being generated for M-way matching problem? I just see load_wav function samples the audio corresponding to the starting index in video frame.

I only see positive samples

joonson commented 4 years ago

The negative samples are the features at different timesteps, within the same batch. In output in this line: https://github.com/joonson/syncnet_trainer/blob/15e5cfcbe150da8ed5c04cfe74a011319ae60d06/SyncNetDist.py#L50 all non-diagonal elements are negatives.

ak-7 commented 4 years ago

Thanks for that explanation.

Is the context window for video and audio frames decided by the kernel size of the first audio and video conv layer? For ex: If we want a context window of size 5 we set the kernel size to be 5?

Also aren't the predictions left centered for this context window? For every frame we take future context of 5 frames to predict the label corresponding to that frame.

joonson commented 4 years ago

Is the context window for video and audio frames decided by the kernel size of the first audio and video conv layer? For ex: If we want a context window of size 5 we set the kernel size to be 5?

Yes

Also aren't the predictions left centered for this context window? For every frame we take future context of 5 frames to predict the label corresponding to that frame.

The predictions lose two frames to both sides because of the receptive field. So for example you need to look at 5th feature to see the output corresponding to 5th-9th frames.

ak-7 commented 4 years ago

Changing the audio kernel size here messes up the dimensions of the model. How did you account for context window size and M inside the model?