joonson / syncnet_trainer

Disentangled Speech Embeddings using Cross-Modal Self-Supervision
MIT License
150 stars 26 forks source link

Evaluation on list save #9

Open annadodson787 opened 3 years ago

annadodson787 commented 3 years ago

Hi, I am wondering what the reasoning behind the evaluation implemented in evaluateFromListSave is - it seems to me this is loading in 2 audio files, running the audio feature extractor on them, and computing the feature-wise cosine distance between them. Where is the video pipeline in this? How is this a good evaluation metric without using the visual stream?

joonson commented 3 years ago

This part of the pipeline is trying to evaluate the quality of audio embeddings for the downstream task of speaker recognition. The audio-visual evaluation can be done using the validation script.