Closed omcar17 closed 3 years ago
@omcar17 Would you be able to send the sample that you are having the problem with? Regarding this paragraph, we are searching for the corresponding audio frame for not only one video frame, but for multiple video frames. This is implemented in SyncNetInstance.py.
Hello @joonson, Thank you for your response. I have been testing this model from sometime now. I am getting the value of confidence in the range of 1 - 2.5 for false videos, while for genuine videos the value of confidence are in the range of 3-8. For some of the videos, I am not getting the value of confidence, as file run_syncnet.py stops in-between (No error). I am unable to understand the reason behind this problem. Also I am getting results of HD video better than webcam video. Is the threshold of confidence value dependent on the quality of video/audio?. Here are the sample results.
Sample 1 - Genuine Video
Duration - 7 secs Type - MPEG-4 video (video/mp4)
AV offset: -1 Min dist: 7.613 Confidence: 7.609
Sample 2 - Genuine Video
Duration - 7 secs Type - MPEG-4 video (video/mp4)
(run_syncnet.py stops after -) Model data/syncnetl2.model loaded. No results
Sample 3 - Genuine Video
Duration - 10 secs MPEG-4 video (video/mp4)
AV offset: -3 Min dist: 12.763 Confidence: 3.136
Sample 4 - False video
Duration - 10 secs Type - Windows Media video (video/x-ms-wmv)
AV offset: -3 Min dist: 12.402 Confidence: 1.038
Thank you.
Would you be able to send Sample 2, since I cannot reproduce the same problem?
video_001.zip This is the zip file file of sample 2. I am not able to find the lipsync error for this video.
Hey, the video is too short. It needs >100 frames long video to work.
Hello, Thank you for the excellent work and publicly available code. I am using syncnet to find if there is lip-sync error in the video. I am getting very random values of AV offset and confidence. I am using the train weights available on official website. Can someone elaborate this paragraph from the paper? -
Determining the lip-sync error - To find the time offset between the audio and the video, we take a sliding-window approach. For each sample, the distance is computed between one 5-frame video feature and all audio features in the ± 1 second range. The correct offset is when this distance is at a minimum. However as Table 2 suggests, not all samples in a clip are discriminative (for example, there may be samples in which nothing is being said at that particular time), therefore multiple samples are taken for each clip, and then averaged.
I am missing something in this paragrah. How do I collect multiple samples for each clip? I would like to know how to get a proper value of metric (AV offset, Confidence) that show the out of sync of video and audio on sample.
Thank you