joonson / syncnet_trainer

Disentangled Speech Embeddings using Cross-Modal Self-Supervision
MIT License
150 stars 26 forks source link

Evaluation Protocol for synchronization accuracy in Perfect Match Paper #10

Open ak-7 opened 3 years ago

ak-7 commented 3 years ago

Hello,

I have a couple of questions regarding the 75.8% synchronization accuracy reported in https://ieeexplore.ieee.org/abstract/document/9067055/

Perfect match Evaluation protocol: The task is to determine the correct synchronisation within a ±15 frame window, and the synchronisation is determined to be correct if the predicted offset is within 1 video frame of the ground truth. A random prediction would therefore yield 9.7% accuracy.

  1. How does changing M affect the model?
  2. The training is a 46-way classification. How exactly do you go from 46-way classification to ±15 way classification?
  3. Do you have the class-split for your evaluation data? Aren't all the test samples in sync? Where do you get out of sync ground truth frames from?
  4. The accuracy for N-way classification reported here is 49%. But your numbers are much higher. I'm wondering why there is a large discrepancy in the two numbers.
  5. The visual stream uses whole face pixels and not just mouth crops. Is that correct?

Thank you!

6eternal6 commented 6 months ago

Hello, can you provide the code of paper, "Perfect match: Improved cross-modal embeddings for audio-visual synchronisation"? thank you! I can provide my e-mail.