Open ak-7 opened 4 years ago
The negative samples are the features at different timesteps, within the same batch. In output
in this line:
https://github.com/joonson/syncnet_trainer/blob/15e5cfcbe150da8ed5c04cfe74a011319ae60d06/SyncNetDist.py#L50
all non-diagonal elements are negatives.
Thanks for that explanation.
Is the context window for video and audio frames decided by the kernel size of the first audio and video conv layer? For ex: If we want a context window of size 5 we set the kernel size to be 5?
Also aren't the predictions left centered for this context window? For every frame we take future context of 5 frames to predict the label corresponding to that frame.
Is the context window for video and audio frames decided by the kernel size of the first audio and video conv layer? For ex: If we want a context window of size 5 we set the kernel size to be 5?
Yes
Also aren't the predictions left centered for this context window? For every frame we take future context of 5 frames to predict the label corresponding to that frame.
The predictions lose two frames to both sides because of the receptive field. So for example you need to look at 5th feature to see the output corresponding to 5th-9th frames.
Changing the audio kernel size here messes up the dimensions of the model. How did you account for context window size and M inside the model?
Where are the negative audio samples being generated for M-way matching problem? I just see load_wav function samples the audio corresponding to the starting index in video frame.
I only see positive samples