Perfect match Evaluation protocol: The task is to determine the correct synchronisation within a ±15 frame window, and the synchronisation is determined to be correct if the predicted offset is within 1 video frame of the ground truth. A random prediction would therefore yield 9.7% accuracy.
How does changing M affect the model?
The training is a 46-way classification. How exactly do you go from 46-way classification to ±15 way classification?
Do you have the class-split for your evaluation data? Aren't all the test samples in sync? Where do you get out of sync ground truth frames from?
The accuracy for N-way classification reported here is 49%. But your numbers are much higher. I'm wondering why there is a large discrepancy in the two numbers.
The visual stream uses whole face pixels and not just mouth crops. Is that correct?
Hello,
can you provide the code of paper, "Perfect match: Improved cross-modal embeddings for
audio-visual synchronisation"?
thank you!
I can provide my e-mail.
Hello,
I have a couple of questions regarding the 75.8% synchronization accuracy reported in https://ieeexplore.ieee.org/abstract/document/9067055/
Perfect match Evaluation protocol: The task is to determine the correct synchronisation within a ±15 frame window, and the synchronisation is determined to be correct if the predicted offset is within 1 video frame of the ground truth. A random prediction would therefore yield 9.7% accuracy.
Thank you!