[EAIS 2020] Emotions Understanding Model from Spoken Language using Deep Neural Networks and Mel-Frequency Cepstral Coefficients

uhhyunjoo / paper-notes

이슈로 가볍게 남깁니다.

0 stars 0 forks source link

[EAIS 2020] Emotions Understanding Model from Spoken Language using Deep Neural Networks and Mel-Frequency Cepstral Coefficients #13

Open uhhyunjoo opened 2 years ago

uhhyunjoo commented 2 years ago

link
paper	Emotions Understanding Model from Spoken Language using Deep Neural Networks and Mel-Frequency Cepstral Coefficients
github	emotion-classification-from-audio-files

uhhyunjoo commented 2 years ago

Abstract

Spoken language 를 이용해서 pepole 을 이해하는 것 -> speaking sound wave 는 variables 이 많기 때문에, machines 에게는 어려운 task 이다.
Speeches understanding 에 대한 sub-task 로, speacker 가 말하는 동안 도출된 emotion 을 detection 하는 것이 있다. 그리고 이게 본 논문의 contribution 의 main focus 이다.
특히, deep neural networks (CNNs) 를 기반으로 하여 speeches 로부터 도출된 emotions 를 classification 을 하는 a model 을 제안한다.
이를 위해, RAVDESS dataset 의 audio recordings 에 집중했다.
모델은 8가지 감정(neutral, calm, happy, sad, angry, fearful, disgust, surprise)을 분류하도록 학습되었다.
이 8가지 감정은, Ekman 이 제안한 것들에 neutral 과 calm 을 더한 것이다.
Evaluation metric 은 F1 score 이고, test set 에 대해 a weighed average 를 0.91 로 얻었고, "Angry" class 에 대해서는 0.95 score 로 best performances 를 얻었다.
worst results 는 "Sad" class 에 대해 얻은 0.87 score 인데, 그럼에도 불구하고 이게 sota 보다 좋다...!

uhhyunjoo commented 2 years ago

Introduction

대화로부터 감정을 이해하는 것이 간과되고 있다.
본 논문은 이러한 측면을 고려하여, 대화 중 주체가 표현하는 main emotion 을 identify 할 수 있게 하는, an efficienty strategy 에 집중했다.
한 번에 하나 이상의 basic emotion 을 나타내기도 하는데, speacker and listener 둘 다에 대해 그런 mixed emotions 의 percentage 를 인식하는 건 extremely difficult 하다는 게 본 연구진의 의견이다.
이를 고려하여, audio track 에서 가장 큰 값을 나타내는 emotion 을 identify 하는 것을 목적으로 하는 모델을 생성했다.
다른 접근 방식들은, computer vision 이나 text analysis 에서처럼 machine 이 feelings 를 classify 하려고 시도했었다.
본 연구는, Mel-frequency cepstral coefficients (MFCC) 를 고려하여 pure 한 audio data 를 사용하는 것을 목적으로 한다.
[ ] diaglogue : 대화

uhhyunjoo commented 2 years ago

Related Work

이전에, 많은 classification strategies 가 제안되었었다.

A real-time emotion recognition from speech using gradient boosting
- Gradient Boosting, KNN, SVM 을 사용해서, RAVDESS dataset 에 대해 gender 에 따른 differences 를 identify 하고, granular classification 을 해서, 특정 task 에 대해서는 40% ~ 80% 의 accuracy 를 얻기도 했다.
- 제안된 classifiers 는 다른 datasets 에 대해는 다르게 작동했다. (근데 일단 이 논문에서는 RAVDESS 에 대해서만 다룰게 ㅇㅇ)
- 세 가지 types of dataset 이 생성됨 : only male recordings, only femal recordings, a combined ones.
- gender / emotion / model 에 따라 성능 차이가 좀 있는듯
Ubiquitous Emotion Recognition Using Audio and Video Data
- audio 에 대해 66.41% 의 accuracy
- audio + video 에 대해 90% 의 accuracy
- faces 와 audio waveforms 를 포함하는 pre-processed image data 가 주어지고, 3 separately deep networks 를 학습시켰다.
- 1 : only on image data, 2 : only on plotted audio waveforms, 3 : both image and waveform data
Recognizing emotion from singing and speaking using shared models
- RAVDESS dataset 을 사용한 첫번째 접근 방식
- 그러나, only some of the emotions avilable 을 classifying 했다.
- overall accuracy 가 본 논문에서 제안한 모델보다 더 높은데, 본 논문보다 less classes 를 사용했다.
- speech and song 에 대한 three shared emotion recognition models 을 제안했다.
- a simple model : domain 에 대해 independent 한 a single classifier 를 생성한다.
- a single-task hierarchical model : domain during training 을 사용한다. 각 도메인에 대해 a separate emotion classifier 를 학습시킨다.
- a multi-task hierarchical model : domain during training 을 사용한다. both domains 에 대해 emotion 을 jointly predict 할 수 있는 a multi-task classifier 를 학습시킨다.
- testing phase 에서, testing data 는 predicted domain 에 따라 separated 된다.
- estimated domain 에 상응하는 classifier 를 사용하여, data 가 analyzed 된다.
- 해당 연구는 directed acyclic graph SVM (DAGSVM) 논문을 채택하여 수행되었다.

uhhyunjoo commented 2 years ago

Proposed Model

model of classification of emotions
based on a deep learning strategy base on convolutional neural networks (CNN) and dense layers.
key idea : "spectrum of a spectrum" 이라고 알려진 Mel-frequency cepstral coefficients (MFCC) 만을 모델 학습에 사용하는 only feature 로 이용하는 것
MFCC 는 a different interpretation of Mel-frequency cepstrum (MFC) 이다.
MFC coefficients 는 sound wave 의 amplitude spectrum 을 a compact vectorial form 으로 나타낼 수 있는 the consequence of their capability 를 갖고 있다.
audio file 아 frames 로 divided 되고, the Discrete Fourier Transformation 이 적용되고, amplitude spectrum 의 logarithm 만이 kept 된다.
amplitude specturm 은 "Mel" fequency scale 의 reduction 에 의해 normalization 된다.
Wave 의 a significant reconstruction 에 대해 더 의미있는 frequency 를 강조하기 위해 수행된다. 이를 통해 human auditory system 이 인식할 수 있게 된다.
각 audio file 으로부터 40개의 features 가 뽑힌다.
각 audio file 은 a floating pioint time series 로 converting 됨으로써 generated 된다.
그리고, time series 로부터 a MFCC sequence 가 생성된다.
MFCC array 가 transposed 되고, horizontal axis 에 따라 arithmetic mean 이 계산된다.
MFCC calculations 는 [1] 과 [5]에서 자세히 설명된다.

uhhyunjoo commented 2 years ago

classification task 를 위해 설계된 deep neural network 는 Fig 1에서 볼 수 있다.
- 해당 네트워크는, input 으로 주어지는 각 audio file 의 40개의 features 에 대한 vectors 에 대해 작동할 수 있다.
- 40 values 는 2초 짜리의 audio frames 의 the compact numerical form 을 나타낸다.
- 결과적으로, size 가 x 40 x 1 인 input 을 입력으로 취한다. a 1D CNN wth a ReLu activation function, dropout of 20% and a max-pooling function 2x2 에 대해 one round 를 수행하기 때문이다.
- ReLu 는 g(z) = max{0,z} 로 formalized 될 수 있고, 이 함수를 적용함으로써, 이 경우 a large value 를 얻게 함으로써 hidden units 을 잘 나타낼 수 있는 좋은 선택을 얻게 한다.
- 이 경우, Pooling 은 model 이 data의 모든 부분의 주요 특성에만 집중해서, positoin 에 대해 invariant 하게 되도록 돕는다.
- 이 과정을, kernel size 를 바꿔서 한 번 더 run 했다.
- 이후에, another dropout 을 적용시키고 output 을 flatten 시켜서, nex layers 에 대해 compatible 하게 만들었다.
- 마지막으로, one Dense layer (fully connected layer) with a softmax activation 을 적용시켜서, output size 를 from 640 elment to 8 으로 vary 하고, properly encoded 된 각 classes 의 probability distribution 을 estimate 했다.
- (0=Neutral; 1= Clam; 2= Happy; Sad=3; Angry=4; Fearful= 5; Disgust=6; Surprised=7).

uhhyunjoo commented 2 years ago

Evaluation of the model

proposed model 을 evaluation 하는 방법은, model 이, future work 에서 사용하기 위해 speeches in real noisy domains 를 포함하는 subject 에 대해 interesting considerations 을 잘 생성하는 데 좋은, results of accuracy 를 잘 뽑아내는 지 조사하기 위해 이끌어내졌다.
본 연구진에 의해 proposed model 말고도 여러 Classification models 가 evaluated 되었고, 얻은 결과에 대한 baseline 을 생성할 수 있었다.
첫번째 approach 로, a decision tree (DT) 와 a random forest (RF) classifiers with 1000 trees 가 수행되었다.
두 모델은 Python library 인 sklearn 의 default paramters 를 사용하여 구현되었다.

uhhyunjoo commented 2 years ago

Dataset

Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

24명의 professional actors 에 의해 만들어진 7536 fiels 를 포함하고 있다. (12 여자, 12 남자)
vocalizing two lexically-matched statements in a neutral North American accent
Speech expressions : calm, happy, sad, angry, fearful, surprise, disgust
Song expressions : calm, happy, sad, angry, fearful
각 expression 은 two levels of emotional intensity (normal, strong), with an additional neutral expression 으로 생성됐다. (세 가지라는 뜻인가?)
three modality formats
- Audio-only (16bit, 48kHz .wav)
- Audio-Video (720p H.264, AAC 48kHz, .mp4)
- Video-only (no sound)
Ratings 는 North America 의 untrained adult research participants 247명이 제공했다.
A further set of 72 partispants 가 test-retest data 를 제공했다.
High levels of emotional validity, interrater reliability, and test-retest intrarater reliability 가 reported 됐다.

7356 개의 RAVDESS files 는 7개의 식별자로 구성된 a unique filename 을 갖고 있다.

Modality (01 = full-AV, 02 = video-only, 03 = audio only);
Vocal channel (01 = speech, 02 = song);
Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised);
Emotional intensity (01 = normal, 02 = strong).
- NOTE: There is no strong intensity for the ’neutral’ emotion;
Statement (01 = ”Kids are talking by the door”, 02 = ”Dogs are sitting by the door”);
Repetition (01 = 1st repetition, 02 = 2nd repetition);
Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

uhhyunjoo commented 2 years ago

Enrichment of training data

Deep learning models 는 data의 양에 대해 struggle with 하기 때문에, training and test set 을 enrich 하게 만들기 위한 a pipeline

video : FFMPEG library 로, a different frequency 의 a new set of featrues
audio from video files : audio a frequency of 44,1MHz
audio files : a frequency of 48MHz
따라서 noise 가 도입되긴하는데 이건 필요한거였고, 아무튼 덕분에 training and test set 의 dimenstion 을 증가시킬 수 있다.

Metrics, data splitting and experimental runs

files 는 train/test 로 randomly splitted 된다. (test 가 33%)
training set : a 3315 MFCC vecotr of 40 features (LibROSA library 사용)
test set shape : 1633 x 40
No cross-validation set
label : class valued encoded (위에서 설명함)
evalutation metric : F1 score -> a compact indicator of the quality of the classifier, standard for comparing our results with sota
trained : sparese categorical cross entropy loss function, rmsprop optimizer for 1000 epochs
best model is used for classification phase
the number of batches : 16 (for optimization reasons)
validate : during training, using accuracy score as common in deep learning architectures

uhhyunjoo commented 2 years ago

Discussion of results

precision, recall, F1
이 결과는 precision and recall are very balanced 해서, 거의 모든 classes 에 대해 0.90 근처의 F1 값을 얻을 수 있다는 걸 보여준다.
F1 results 가 별로 차이 안 나는 것은, 8가지 emotion classes 에 대해 correctly classify 하도록 effectively manages 하는 model 의 robustness 를 보여준다.
Sad, Surprised : 모델이 less accurate 함 -> 딱히 놀랍지는 않은 게, literature 에서 인식하기 어려운 classes 로 알려져있다. (speech, facial expression, analysing written text 등...)

in order to evaluate the effectiveness of the classification of emotions -> compare it with the resuslts obtained from two baselines decision tree (DT) and random forest (RF) and the works of [6] and [16].
classes 의 수가 증가할 수록, 더 어렵고 accuracy 가 떨어진다는 것이 알려져있다.
그럼에도 불구하고, 제안된 CNN-MFCC 모델은 two jobs we have been confronted with 에 대해 F1 score 가 평균적으로 동등하다.

A further index of model realibility 는 Fig. 2 와 Fig. 3 에서 찾아볼 수 있다.
Fig. 2 에서, value of loss (error in the accuracy of the model) 이 test set 과 training set up to the 1000th epoch 에서 decrease 한다는 것을 볼 수 있다.
decrease is less evident from the 400th epoch 하긴한데 여전히 perceptible 하다.'

Fig 3. 에서, avearge value of accuracy on all the classes 가, loss 와는 다르게, ages 가 increases 할 수록 increase 하는 걸 볼 수 있다.
이러한 values of loss and accuracy 는 training 과 test dataset 에서 그닥 다르지는 않다. overffiting 되지 않았음을 보여준다.
사실 이러한 결과는, 이전에 observed 했던 F1 score 와 일치한다.

본 연구진은 결과가 encouragning 하다고 본다.

RAVDESS 보다 큰 데이터셋을 사용할 수 있으면, MFCC 는 valid emotion detection feature 가 될 것이다.
same model structure 를 사용하면, less strutured 되고 collected direcly in a real noise environment 된 audio sound files 에 대해서도 비슷하게 수행할 수 있을 것이라고 생각한다.
MFCC transformation 는 언제나 applicable 하고, noise reduction 과 enough training data 를 사용하여 잘 작동할 수 있다.
이에 따라, 추후 연구로 real users 로부터 directly collected 된 dialog 의 pieces 를 사용해서 실험하는 작업을 하고 있다.

uhhyunjoo commented 2 years ago

	Speech	Song
오디오 + 비디오	2880	2024
오디오	1440	1012

오디오+비디오의 speech + song = 4904
오디오의 speech + song = 2452
다 합치면 7356