Hello, thanks for releasing the pytorch version of the code!
I have a couple questions that sync this repo with the paper (sorry for the pun
fc7 in the paper is a 256-d vector whereas here the output feature is 1024-d (at lease the pretrained model seems to be), is it a newer/better version of this work or am I looking at the wrong place?
in the file SyncNetInstance.py line 107, there is a *4 applied to the sampling of the audio, I suspect that refers to some sort of stride, however I seem to miss the part in the paper mentioning this stride (perhaps too fundamental?), would you explain what it is?
Hello, thanks for releasing the pytorch version of the code! I have a couple questions that sync this repo with the paper (sorry for the pun