Evaluation Protocol for synchronization accuracy in Perfect Match Paper

Hello,

I have a couple of questions regarding the 75.8% synchronization accuracy reported in https://ieeexplore.ieee.org/abstract/document/9067055/

Perfect match Evaluation protocol: The task is to determine the correct synchronisation within a ±15 frame window, and the synchronisation is determined to be correct if the predicted offset is within 1 video frame of the ground truth. A random prediction would therefore yield 9.7% accuracy.

How does changing M affect the model?
The training is a 46-way classification. How exactly do you go from 46-way classification to ±15 way classification?
Do you have the class-split for your evaluation data? Aren't all the test samples in sync? Where do you get out of sync ground truth frames from?
The accuracy for N-way classification reported here is 49%. But your numbers are much higher. I'm wondering why there is a large discrepancy in the two numbers.
The visual stream uses whole face pixels and not just mouth crops. Is that correct?

Thank you!

joonson / syncnet_trainer

Evaluation Protocol for synchronization accuracy in Perfect Match Paper #10