joonson / syncnet_python

Out of time: automated lip sync in the wild
MIT License
664 stars 148 forks source link

Lip sync error using Syncnet #12

Closed omcar17 closed 3 years ago

omcar17 commented 5 years ago

Hello, Thank you for the excellent work and publicly available code. I am using syncnet to find if there is lip-sync error in the video. I am getting very random values of AV offset and confidence. I am using the train weights available on official website. Can someone elaborate this paragraph from the paper? -

Determining the lip-sync error - To find the time offset between the audio and the video, we take a sliding-window approach. For each sample, the distance is computed between one 5-frame video feature and all audio features in the ± 1 second range. The correct offset is when this distance is at a minimum. However as Table 2 suggests, not all samples in a clip are discriminative (for example, there may be samples in which nothing is being said at that particular time), therefore multiple samples are taken for each clip, and then averaged.

I am missing something in this paragrah. How do I collect multiple samples for each clip? I would like to know how to get a proper value of metric (AV offset, Confidence) that show the out of sync of video and audio on sample.

Thank you

joonson commented 5 years ago

@omcar17 Would you be able to send the sample that you are having the problem with? Regarding this paragraph, we are searching for the corresponding audio frame for not only one video frame, but for multiple video frames. This is implemented in SyncNetInstance.py.

omcar17 commented 5 years ago

Hello @joonson, Thank you for your response. I have been testing this model from sometime now. I am getting the value of confidence in the range of 1 - 2.5 for false videos, while for genuine videos the value of confidence are in the range of 3-8. For some of the videos, I am not getting the value of confidence, as file run_syncnet.py stops in-between (No error). I am unable to understand the reason behind this problem. Also I am getting results of HD video better than webcam video. Is the threshold of confidence value dependent on the quality of video/audio?. Here are the sample results.

Sample 1 - Genuine Video

Duration - 7 secs Type - MPEG-4 video (video/mp4)

AV offset: -1 Min dist: 7.613 Confidence: 7.609

Sample 2 - Genuine Video

Duration - 7 secs Type - MPEG-4 video (video/mp4)

(run_syncnet.py stops after -) Model data/syncnetl2.model loaded. No results

Sample 3 - Genuine Video

Duration - 10 secs MPEG-4 video (video/mp4)

AV offset: -3 Min dist: 12.763 Confidence: 3.136

Sample 4 - False video

Duration - 10 secs Type - Windows Media video (video/x-ms-wmv)

AV offset: -3 Min dist: 12.402 Confidence: 1.038

Thank you.

joonson commented 5 years ago

Would you be able to send Sample 2, since I cannot reproduce the same problem?

omcar17 commented 5 years ago

video_001.zip This is the zip file file of sample 2. I am not able to find the lipsync error for this video.

mdoulgerakis commented 3 years ago

Hey, the video is too short. It needs >100 frames long video to work.