Open mirix opened 1 year ago
Ok, perhaps I am getting to something:
import h5py
import numpy as np
import pandas as pd
filename = '/home/emoman/Downloads/mosei/CMU_MOSEI_Labels.csd'
hf = h5py.File(filename)
features = hf.get('All Labels/data/zv0Jl4TIQDc/features')
feat = np.array(features)
df_feat = pd.DataFrame(feat)
print(df_feat)
intervals = hf.get('All Labels/data/zv0Jl4TIQDc/intervals')
intval = np.array(intervals)
df_intval = pd.DataFrame(intval)
print(df_intval)
This gives:
0 1 2 3 4 5 6
0 0.333333 0.666667 0.0 0.666667 0.0 0.0 0.0
1 1.000000 2.000000 0.0 0.000000 0.0 0.0 0.0
2 2.333333 2.666667 0.0 0.000000 0.0 0.0 0.0
0 1
0 56.852 60.845
1 29.764 35.633
2 42.146 49.242
My interpretation is that video zv0Jl4TIQDc has three intervals annotated with the relative weights of Ekman's basic emotions.
Is that correct?
If that is the case, what would be the mapping of the emotions?
What is the highest possible value for a given emotion?
Each sentence is annotated for sentiment on a [-3,3]
Likert scale of: [−3: highly negative, −2 negative,
−1 weakly negative, 0 neutral, +1 weakly positive,
+2 positive, +3 highly positive]. Ekman emotions
(Ekman et al., 1980) of {happiness, sadness, anger,
fear, disgust, surprise} are annotated on a [0,3] Lik-
ert scale for presence of emotion x: [0: no evidence
of x, 1: weakly x, 2: x, 3: highly x].
So column zero is the Likert score and then the other columns would be, in this order, {happiness, sadness, anger, fear, disgust, surprise} ?
The issue with this interpretation is that segment 0 above would have been labelled with happiness and anger in similar amounts...
Or is it (Anger Disgust Fear Happy Sad Surprise) as in Table 3?
Then it would be Anger and Fear, which is more consistent, but the sentiment would be slightly positive...
Checking the entries with the most negative and positive sentiment, it seems to be {happiness, sadness, anger, fear, disgust, surprise}
I have forked MOSEI to build a unimodal SER dataset:
Hello,
I would be interested to train an audio-only model (or, perhaps, a bimodal audio-text one) using CMU-MOSEI data.
I would be recomputing the audio embeddings.
So I would need only the links to the videos plus the timestamps and the annotated emotions per timestamp range.
How would I go about extracting this information?
Thanks,
Ed