joonson / syncnet_python

Out of time: automated lip sync in the wild
MIT License
652 stars 146 forks source link

Why confidence and the distance for an original video is coming Low and High respectively? #66

Open Himanshu21135 opened 6 months ago

Himanshu21135 commented 6 months ago

@joonson I have some doubt in the code of SyncNetInstance.py.

Screenshot 2024-04-08 132416

In the function calc_pdist the reason to consider the window it to take the consideration of the offset right? The way you are computing this distance it would return you the shape of (lastframe, window_size) when you perform torch.stack(dists,1) and then later you find mdist and I am unable to understand the logic behind this computation in the code you have done mdist = torch.mean(torch.stack(dists,1),1) i.e., you have taken the average across the column which gives you the mdist of the shape(1,31) i.e., simply list of 31 values. Would you please give your views on why have you taken the mean across column because from my understanding the mean should be taken across rows then it would be of shape (lastframe, 1) i.e., mean for each frame while considering a window.

Also I have performed an Experiment in which I have computed the distance and confidence for an original file which was not dubbed and for that the distance I am getting is pretty high and confidence is very low but it supposed to be the distance would be coming low and the confidence should be high and then I have created a dubbed video of an speaker saying the same statement said in the original file using wave2lip model and then computed the distance and confidence and this distance is comparable lower with respect to the distance computed for original video. What would be the reason for this?

Please give your views on why taking the mean across column not across rows?