Request for some examples

Hello, thank you for your contribution. Could you please provide some examples of MELD-FAIR, such as the active speakers' face crops extracted from dia0_utt12 and dia0_utt13 in the original training set?

paper's example

Hi, the active speaker detection (ASD) procedure is executed after the proper realignment of the videos. Running the ASD procedure on a misaligned video would imply in having different people being selected as the active speaker in the different frames of the same video. Also, doing the whole process in this order was more convenient since nearly all utterances in MELD (more than 99% of the utterances) are spoken by a single person. This way, we could select the person who has been consistently identified as the active speaker throughout the video.

The extracted face crops of the active speaker in the realigned version of dia0_utt12 would be 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010 0011 0012 0013 0014 0015

and those of the active speaker in the realigned version of dia0_utt13 would be 0004 0005 0006 0007 0008 0009 0010 0011 0012 0013 0014 0015 0016 0017 0018 0019 0020 0021 0022 0023 0024 0025 0026 0027 0028 0029 0030 0031 0032 0033 0034 0035 0036 0037 0038 0039 0040 0041 0042 0043 0044 0045 0046 0047 0048

Thanks for the example you gave. However, when I followed the steps to generate all csv files and face sequences, dia0_utt13 was missing in MELD/realigned/train/faces/0000, as shown in the figure below. face_folder

I checked MELD_all_faces_bboxes_and_tracks.csv, which is the face sequence containing dia0_utt13 correctly, as shown in the image below.

However, the information of dia0_utt13 is missing in MELD_active_speaker_face_bboxes.csv, as shown in the figure below. lack_of_dia0_utt13

This missing situation not only exists in train's dia0, but also in train's dia1 (utt0, 1, 2, 3, 6 only. 4 and 5 are missing), as shown in the figure below. face_folder2

This seems to be a missing problem when running python3 -m MELD-FAIR.asd.active_speaker_detection. How can I solve this problem? How do I get a MELD_active_speaker_face_bboxes.csv file with no missing information I look forward to your reply.

I apologize for the delayed response; it took some time to thoroughly analyze the underlying issues. Upon investigation, I found that the CSV data, accessible at MELD_active_speaker_face_bboxes.csv, was generated through a complex process involving four distinct active speaker detection models:

TalkNet trained on the AVA-ActiveSpeaker dataset
TalkNet trained on the Columbia dataset
ASC trained on AVA-ActiveSpeaker
FaVoA – an active speaker detection model built on ASC by the authors of MELD-FAIR, though its code has not been made publicly available – trained on AVA-ActiveSpeaker.

The rationale for employing this intricate approach lies in the varying strengths of these models: both implementations of TalkNet excel in longer videos, while the other models demonstrate superior performance in shorter videos. The resulting scores were organized in a detailed manner to prioritize responses from the models that appear first in the list, with an emphasis on ensuring that at least one person is tagged as an active speaker, aligning with MELD's requirement that each video should feature at least one active speaker (with most containing just one).

However, given the availability of more advanced and publicly accessible active speaker detection models since the publication of MELD, it seems unnecessary to retain this intricate procedure. Including it would complicate the understanding of the methodology presented in the paper without significant benefits. I recommend either utilizing the existing data in MELD_active_speaker_face_bboxes.csv, which already addresses many mismatch cases in MELD, or incorporating more recent and higher-performing active speaker detection models in place of TalkNet.

knowledgetechnologyuhh / MELD-FAIR

Request for some examples #1