vskadandale / vocalist

Official repository for the paper VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices
Other
61 stars 7 forks source link

Having some gaps in understanding the working of the code. #13

Open AmanSavaria1402 opened 1 year ago

AmanSavaria1402 commented 1 year ago

Hi, I'm going through the code after I went through the paper for this project, and I'm having some doubts when trying to relate the code back to the things mentioned in the paper. For some context, I'm using the acapella dataset and the trained model weights on that.

  1. The paper mentions the input image size is 3,48,96,t_v, but in the code implementation, the image size that goes in the model as input is: torch.Size([31, 15, 48, 96]).
  2. Second and the more important question is, that the input images/frames to the model are repeated _winsize times (31 by default), while the mel spectrograms for the 15 preceding and succeding frames are taken, am I correct in interpreting that, if yes, what is the reason behind that since the spectrograms are from actual frames, while the input frames are just repeats. raw_sync_scores = model(lim_in[i].unsqueeze(0).repeat(win_size, 1, 1, 1).to(device), feat2p[i:i + win_size, :].to(device))