marcogdepinto / emotion-classification-from-audio-files

Understanding emotions from audio files using neural networks and multiple datasets.
GNU General Public License v3.0
405 stars 133 forks source link

Duplicate data #6

Closed vpodpecan closed 5 years ago

vpodpecan commented 5 years ago

There is a major issue regarding the data. The audio data in video files is the same as audio-only data. This means that you have duplicates in your data and this results in false 92% accuracy.

marcogdepinto commented 5 years ago

Could you please mention where you found this in the official documentation of the data?

From what I saw (but I can be wrong) the videos are different, and also recorded in two different formats.

In fact, quoting the docs: "Speech files (Video_Speech_Actor_01.zip to Video_Speech_Actor_24.zip) collectively contains 2880 files: 60 trials per actor x 2 modalities (AV, VO) x 24 actors = 2880."

In the audio files the modality is not mentioned.

Have you written to the data owners and received an official response you can share about this? Please provide data to justify what you are saying. In the meanwhile I close the issue.

vpodpecan commented 5 years ago

Quoting the source page https://zenodo.org/record/1188976:

All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound).

marcogdepinto commented 5 years ago

Thanks for your answer. The point you mentioned does not say that the audio in the recordings and the audio in the video are the same. It says that there are 3 modalities, but audio only and audio video could be recorded in different moments. Do you have a point which specifies that the audio only files have audio that is the same of the video-audio?

vpodpecan commented 5 years ago

If you do not believe this is the case, listen for example to the following files or open them in some audio editing program: 03-02-01-01-01-01-01.wav 01-02-01-01-01-01-01.mp4 They are in different modality (03 = audio-only and 01 = full-AV) but have the same audio content.

marcogdepinto commented 5 years ago

Thanks for pointing out: it is an issue that can be easily solved adding noise to a set of samples (instead of 2 equal samples, we can use one with added noise and one without it). To be honest I am not working on this project in this moment, but if you want to dive deep you can try to add noise to the samples, train again the model and submit a pull request that will enrich the project :)

weimiao86 commented 5 years ago

@marcogdepinto Nice job, I learned a lot from this project, Thanks! And I agree with @vpodpecan that the audio in .wav and .mp4 format are the same. I used the audio only files and got 59.45% accuracy. And I have another concern that should we split the training data and testing data by actors? because if we use the audio from the same actor in both training data and testing data, we may get a fake accuracy. My suggestion is to use several actor's audio files, for example, 2 males and 2 females for testing, and use the rest for training. And we can calculate the average accuracy by doing that iteratively.

marcogdepinto commented 5 years ago

@weimiao86: thanks for your comment! If you are interested in enriching the project, just open a new branch and/or submit a push request to contribute!

marcogdepinto commented 4 years ago

Hi, I recently took again this project as I am writing a paper on it, so I have an answer to this issue. The FFMPEG library that extracts additional data from the video (https://github.com/marcogdepinto/Emotion-Classification-Ravdess/blob/master/Mp4ToWav.py) is setting a frequency of 44100 (ffmpeg command -ar, more on https://ffmpeg.org/ffmpeg-all.html ) that corresponds to 44,1kHz. The original frequency of the audio files in the dataset is 48kHz (source: https://zenodo.org/record/1188976#.XYendZMzZN1 ), so the features created by librosa MFCC should be different. When I have some time, I will try extracting the features of the files 03-01-01-01-01-01-01.wav and 01-01-01-01-01-01-01.wav and check the arrays generated to confirm on this.