Open HideOn1Bush opened 1 year ago
Hi, the active speaker detection (ASD) procedure is executed after the proper realignment of the videos. Running the ASD procedure on a misaligned video would imply in having different people being selected as the active speaker in the different frames of the same video. Also, doing the whole process in this order was more convenient since nearly all utterances in MELD (more than 99% of the utterances) are spoken by a single person. This way, we could select the person who has been consistently identified as the active speaker throughout the video.
The extracted face crops of the active speaker in the realigned version of dia0_utt12 would be
and those of the active speaker in the realigned version of dia0_utt13 would be
Thanks for the example you gave. However, when I followed the steps to generate all csv files and face sequences, dia0_utt13 was missing in MELD/realigned/train/faces/0000, as shown in the figure below.
I checked MELD_all_faces_bboxes_and_tracks.csv, which is the face sequence containing dia0_utt13 correctly, as shown in the image below.
However, the information of dia0_utt13 is missing in MELD_active_speaker_face_bboxes.csv, as shown in the figure below.
This missing situation not only exists in train's dia0, but also in train's dia1 (utt0, 1, 2, 3, 6 only. 4 and 5 are missing), as shown in the figure below.
This seems to be a missing problem when running python3 -m MELD-FAIR.asd.active_speaker_detection. How can I solve this problem? How do I get a MELD_active_speaker_face_bboxes.csv file with no missing information I look forward to your reply.
I apologize for the delayed response; it took some time to thoroughly analyze the underlying issues. Upon investigation, I found that the CSV data, accessible at MELD_active_speaker_face_bboxes.csv, was generated through a complex process involving four distinct active speaker detection models:
The rationale for employing this intricate approach lies in the varying strengths of these models: both implementations of TalkNet excel in longer videos, while the other models demonstrate superior performance in shorter videos. The resulting scores were organized in a detailed manner to prioritize responses from the models that appear first in the list, with an emphasis on ensuring that at least one person is tagged as an active speaker, aligning with MELD's requirement that each video should feature at least one active speaker (with most containing just one).
However, given the availability of more advanced and publicly accessible active speaker detection models since the publication of MELD, it seems unnecessary to retain this intricate procedure. Including it would complicate the understanding of the methodology presented in the paper without significant benefits. I recommend either utilizing the existing data in MELD_active_speaker_face_bboxes.csv, which already addresses many mismatch cases in MELD, or incorporating more recent and higher-performing active speaker detection models in place of TalkNet.
Hello, thank you for your contribution. Could you please provide some examples of MELD-FAIR, such as the active speakers' face crops extracted from dia0_utt12 and dia0_utt13 in the original training set?