katerynaCh / MMA-DFER

This repository provides the codes for MMA-DFER: multimodal (audiovisual) emotion recognition method. This is an official implementation for the paper MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild.
13 stars 1 forks source link

MAFW dataset results are not achievable #5

Open qq1332427275 opened 3 months ago

qq1332427275 commented 3 months ago

1719969955264 Hello author! When I tried to reproduce the results on the MAFW dataset, I first cut the frames of its video through ffmpeg (the number of frames of each video cut is one more than the number of annotation you provided, but I have no problem to check the quality of the frames), and the audio was also extracted through ffmpeg to produce a .MP3 file, but the results of the training of the model on this dataset do not reach the accuracy (aka WAR). Accuracy (aka WAR), above are the results of my training, and the loss of the test set is back and forth, I think maybe the dataset was processed wrongly at some step? (I have no problem reproducing the results on the DFER dataset after processing the data using this method.) Can you please provide me with a copy of the results for the MAFW dataset? Thank you very much for your reply!

qq1332427275 commented 3 months ago

I found that some of the edited frames have unaligned faces, e.g. the edited image contains not only faces but also part of the background, I use the extract_faces_mfaw.py you provided and still have the above problem, e.g. frames 00042/00043/00008/00007 of video 00025, the image contains not only faces but also backgrounds. Is there a problem here? Do you have this problem with your data?

katerynaCh commented 2 months ago

Hi!

  1. I have uploaded pre-trained MAFW models, can you try to evaluate using them and see if you are still getting low WAR? If you do then it is likely a preprocessing issue and I can try to dig more into it
  2. Which fold is the plot from and what UAR/WAR are you getting? Note that in MAFW folds have higher variance in performance, and especially 1st fold is noticeably lower than others.
  3. For video 0025 I have checked that I also have not only faces but background on some frames, so although this is not optimal for the model, this is consistent with my data. Specifically I have a close-up of face on about half of the frames, and background included (frame cropped by person's chest and the face is roughly in the upper-right corner) on the rest
katerynaCh commented 2 months ago

Hi! @qq1332427275 were you able to achieve the desired results with the pre-trained models?