Speech Emotion Recognition (SER)

mirix commented 1 year ago

Apparently, and according to its own creators, the audEERING model was not the wisest of choices.

To address such shortcomings, I have forked CMU-MOSEI:

https://github.com/mirix/messaih

The fork is tailored towards Speech Emotion Recognition (SER).

The idea now would be training a model on messAIh and see how it behaves on a real-life scenario.

YuryKonvpalto commented 1 year ago

Hi Mirix,

please tell, how did you convert VAD values of the model ([0...1]) to Euclidian space ([-1..1]? Model values seems to be always around 0,3... each.

mirix commented 1 year ago

I tried different normalisation strategies from sklearn, but, in the end, I settled on this:

# 0 1 scale to -1 1 scale
vad_sample = (vad_sample - .5) * 2

If you think about it, it is like converting, say Fahrenheit to Celsius. You have to shift the zero and re-scale the value of the degree:

https://www.calculatorsoup.com/calculators/conversions/fahrenheit-to-celsius.php

YuryKonvpalto commented 1 year ago

Ok. But their model always predict values for each V-A-D starting with 0,3... (I tested a lot of audio files, but naver has got values above or beneath 0,3). So the values that differ are going after 0,3...

Taking your way of conversion (vad_sample = (vad_sample - .5) * 2) - it would always give a negative value. E.g.: V = 0.334; A=0.330; D=0.336; - if we subtract 0.5 from each value, it goes to negative.

In your code you stipulate 'Centroids of the Ekman emotions in a VAD diagram'. For JOY it is = {'v': 0.76, 'a': 0.48, 'd': 0.35} But because of the above mentioned reasons we can't even get a small postive value (if we converse like that vad_sample = (vad_sample - .5) * 2).

Do I miss something?

mirix commented 1 year ago

It works relatively well for me. If you obtain constant values and I had to guess, there is an issue with your embeddings... Which may actually come from your audio.

I am having the same problem when trying to fine-tune a wav2vec2 model, I obtain constant values and I am guessing that the issue comes from reading the numpy arrays with the feature extractor, but I am still investigating.

YuryKonvpalto commented 1 year ago

It's strange... May be you have downloaded a model version from HuggingFace and this version works a little bit different in comparison with version deployed now on HF?.. Using a Huggingface version of model (via HF itself) it always gives values starting from 0.3. Values though are not constants, - the digits after 0,3 (i.e. after first decimal) are different and vary according to input audio.

I guess for each V-A-D a zero-point is 0.3333333... So if it goes beneath 0.3333 - I assume it becomes negative; when it goes above 0.3333 - I assume it becomes positive.

May I please kindly ask you to occasionally check the version of model deployed on HF - do you get other values of each VAD higher or lower than 0.3....?

mirix commented 1 year ago

The model is over 1 year old, so my guess is that the issue is feature extraction.

mirix commented 1 year ago

Anyway, I am not using that model anymore. I am trying to train my own. I just posted the script to hear people's opinions on the VAD-to-Ekman conversion.

It actually works relatively well (compared to other models, of course). Sentence-by-sentence there are many errors, but, if you consider the conversation as a whole, clusters of certain emotions are typically good indicators for flagging the conversation.

The main issue is that we are particularly interested in detecting fear, and it seems that that is precisely one of the model's weak points. The problem is the training dataset.

YuryKonvpalto commented 1 year ago

Very interesting. I found even better reseach in the field of conversion VAD to Eckman in this paper: https://www.researchgate.net/publication/284724383_Affect_Representation_and_Recognition_in_3D_Continuous_Valence-Arousal-Dominance_Space

It takes 15 emotions and maps in Tables their mean values and standard deviations to each of VAD. And provides Euc.distances between all the 15 basic emotions. Very intersting. So far as I have read, the fear perseption stays the most challenging task. I think it goes from psychology - one rarely masks joy or sadness, but almost everyone try to mask his fear..

What I trying to acheave is a webapp, that records conversation and send on the fly chunks (3-5 sec) of it to model for emotions evaluation. According to evaluated atmosphere it suggests a music background (music pieces with VADs corresponding to conversation's VAD. In the theory (if we consider VAD as a vector) you can add to it a VAD vector of music piece, thats either neglect conversation VAD vector (i.e. joy (positive) vector of music piece is added to sad (negative) vector of conversation - it becomes atleast neutral) or even enhance it (makes a positive VAD vector of conversation even more positive after adding a music VAD vector).

mirix commented 1 year ago

It sounds amazing.

mirix / approaches-to-diarisation

Speech Emotion Recognition (SER) #6