Hi,
I'm going through the code after I went through the paper for this project, and I'm having some doubts when trying to relate the code back to the things mentioned in the paper. For some context, I'm using the acapella dataset and the trained model weights on that.
The paper mentions the input image size is 3,48,96,t_v, but in the code implementation, the image size that goes in the model as input is: torch.Size([31, 15, 48, 96]).
Second and the more important question is, that the input images/frames to the model are repeated _winsize times (31 by default), while the mel spectrograms for the 15 preceding and succeding frames are taken, am I correct in interpreting that, if yes, what is the reason behind that since the spectrograms are from actual frames, while the input frames are just repeats.
raw_sync_scores = model(lim_in[i].unsqueeze(0).repeat(win_size, 1, 1, 1).to(device), feat2p[i:i + win_size, :].to(device))
Hi, I'm going through the code after I went through the paper for this project, and I'm having some doubts when trying to relate the code back to the things mentioned in the paper. For some context, I'm using the acapella dataset and the trained model weights on that.
raw_sync_scores = model(lim_in[i].unsqueeze(0).repeat(win_size, 1, 1, 1).to(device), feat2p[i:i + win_size, :].to(device))