marcogdepinto / emotion-classification-from-audio-files

Understanding emotions from audio files using neural networks and multiple datasets.
GNU General Public License v3.0
408 stars 134 forks source link

Overfitting? #11

Closed alezenonos closed 4 years ago

alezenonos commented 4 years ago

This is an inspiring piece of work and thank you for keeping it open source. I was just wondering whether it is demonstrating an overfitting situation. Specifically, audio from the video is the same as the speech files. This contaminates the test set with examples from the training set.

For example let's say you have X1,X2,X3,X4,X5 data points in training set and X2,X4,X5,X6 in test set. X2,X4,X5 are more likely to get right since already seen in the training phase and thus do not reflect the true predictive power of your models.

To validate this, we can either remove extracted audio from the video or make sure that both the audio extracted from video and speech audio file are in either train or test set together.

marcogdepinto commented 4 years ago

Hi, this model is built only to solve this problem, predicting emotions on the RAVDESS dataset. Overfitting is correct in this case because is what I wanted to achieve. If you want a more generalized model, feel free to reduce the number of layers in the neural network and/or remove/add different training data. This is not an issue but expected behaviour, hence I am resolving the issue.

EnisBerk commented 4 years ago

I wish I saw this last week. OMG. I came here to open the exact same issue.

Dear @marcogdepinto , @alezenonos is right. You are training and testing on the same files. So your model is not learning anything about emotions but it is just memorizing files. You could just give filenames rather than sound and you would get the same results.

You are separating audio files from the video for no reason because audio files are already coming from the videos. Please share this with big letters on top of your readMe file because you just cost me hours of wondering why I cannot reproduce your results.

marcogdepinto commented 4 years ago

@EnisBerk I am afraid that was not clear. The name of the project is "Emotion Classification RAVDESS" for a reason.

I have added the following sentence on top of the README: "Please note this project is not made for generalization: it is built to work only with the files of the RAVDESS dataset, not for any audio file".

Sorry again for the misunderstanding.

alezenonos commented 4 years ago

Hi @marcogdepinto. This is not about generalising on other datasets. This is about learning features from this dataset. By contaminating the test set with data from the training set you wouldn't need an ML algorithm at all. It almost becomes a deterministic problem. Nevertheless, your code is really useful and thank you for this. The only change i would make to deal with this issue is remove the video extracted audio in order for the ML algorithm to actually learn the features from the audio. I understand this is not an active project so this might help others as well. Accuracy won't be as good but it would be more realistic. This issue is also called data leakage.

marcogdepinto commented 4 years ago

Hi @alezenonos , thanks for the hints, very much appreciated! One thing that I noticed reading previous issues is that the feature extracted from audio should have some noise within it (https://github.com/marcogdepinto/Emotion-Classification-Ravdess/issues/6 ). The FFMPEG library that extracts additional data from the video (https://github.com/marcogdepinto/Emotion-Classification-Ravdess/blob/master/Mp4ToWav.py) is setting a frequency of 44100 (ffmpeg command -ar, more on https://ffmpeg.org/ffmpeg-all.html ) that corresponds to 44,1kHz. The original frequency of the audio files in the dataset is 48kHz (source: https://zenodo.org/record/1188976#.XYendZMzZN1 ), so the features created by librosa MFCC should be different. Unfortunately I have never had time to make a test on two files to check if the generated arrays have different values from the original ones. If the values are different, this could be considered as a data augmentation approach (e.g. what is done when rotating pictures in case of computer vision problems ). Correct me if I am wrong here. On the other hand, if the array is the same you are right, when I have some time I will re-train the model without the audio extracted from the videos to review the changes (if you want to do the test yourself, that would be great!). I may also work on having a test set with the files of the last 3 actors and excluding those from the training. Only issue is that I do not have time to do this now honestly, hopefully in the next months I'll be able to code a different approach and compare results.

alezenonos commented 4 years ago

Hi @marcogdepinto, you are right! Data augmentation would help a lot as long as the values are different and as long as it does not fundamentally change how the emotion is conveyed through audio. So someone should still consider what augmentations to do. In this particular case, if you have no duplicate data from the training set also in test set it should be fine.

EnisBerk commented 4 years ago

Hi @alezenonos @marcogdepinto , Thanks for your quick responses on an inactive repo. I agree that the data augmentation approach is a good idea. You just need to make sure the augmentations of the same file are not in the training and test at the same time. Unless augmentation changes the emotion in the audio.

marcogdepinto commented 4 years ago

@EnisBerk @alezenonos no worries. Again, have no bandwidth to work on this now, also considering this will require huge refactoring (written in 2018 with VERY BAD code style, it is also necessary to migrate from jupyter to proper .py files, classes etc). Hope I'll be able to pick it at a certain point in the next months. Thanks both for the valuable input.

marcogdepinto commented 4 years ago

@EnisBerk @alezenonos hey both, just an heads up. I found the time to do all the stuff above. This project just received a major refactoring. I have

  1. moved everything from jupyter to proper python files;
  2. refactored the code (still some work to do there) and added docstrings/comments;
  3. removed the audio features extracted from the videos;
  4. added a pipeline to include a new set of features extracted from the TESS dataset.
  5. reviewed the README to explain better how everything works.

Points 3-4 reduced accuracy to 80% but the model should be able to generalize better.

Thank you both for having inspired the revamping of this project!

EnisBerk commented 4 years ago

Thank you for taking the time to improve the repository. I am sure it will be helpful to others.