scutcsq / DWFormer

DWFormer: Dynamic Window Transformer for Speech Emotion Recognition(ICASSP 2023 Oral)
45 stars 3 forks source link

Feature extraction #8

Closed basavarajsh98 closed 5 months ago

basavarajsh98 commented 6 months ago

The extracted wavLM features have different dimensions for each input file. The shape of numpy feature vector of a input file is (67,1024). Does this mean there are 67 feature vectors with each of 1024 dimensions? Are these the frame level features? Can we pool over these feature vectors to have (1,1024) shape? Would this then be the utterance level feature?

Thanks,

scutcsq commented 6 months ago

Hello, the length of the feature is corresponding to the length of the audio, the length of one token is nearly 20ms in the speech. So you could regard this feature as frame level feature and when you pool over the feature vector, it could represent utterance level feature.

basavarajsh98 commented 6 months ago

Thanks. So was my understanding right? A generated .npy file with dimension (67,1024) mean 67 feature vectors with 1024 dimensions each? As in 67 tokens with feature vector of 1024 dimension each?

I believe DWFormer was experimented with frame-level features. Did you by any chance test with utterance level as well?

Also, 12th layer of WavLM Large was used for feature extraction. How was this layer chosen? As per the wavLM paper, the top layers are suitable for ASR and lower layers for speaker related tasks.

scutcsq commented 6 months ago

Thanks for your question. A generated .npy file with dimension(67, 1024) means 67 feature vectors with 1024 dimensions each, and the order of the feature vectors corresponds with the speech. Our designed model is for frame-level features, since we focus on discovering the emotional information at local scales. Therefore, utterance level feature is not suitable for our framework. As mentioned in paper, variation of pitch and speech rate, sound of laughter and sigh, and word could represent emotion. Therefore, to better recognize emotion in speech, both semantic information and para-lingual information need to be considered. Since the top layers of WavLM are suitable for semantic information extraction and the lower layers of WavLM are utilized for para-lingual information extraction, we use the 12th layer of WavLM Large for feature extraction.

basavarajsh98 commented 5 months ago

Thank you so much for your detailed explanation! I really appreciate it!