Closed basavarajsh98 closed 5 months ago
Hello, the length of the feature is corresponding to the length of the audio, the length of one token is nearly 20ms in the speech. So you could regard this feature as frame level feature and when you pool over the feature vector, it could represent utterance level feature.
Thanks. So was my understanding right? A generated .npy file with dimension (67,1024) mean 67 feature vectors with 1024 dimensions each? As in 67 tokens with feature vector of 1024 dimension each?
I believe DWFormer was experimented with frame-level features. Did you by any chance test with utterance level as well?
Also, 12th layer of WavLM Large was used for feature extraction. How was this layer chosen? As per the wavLM paper, the top layers are suitable for ASR and lower layers for speaker related tasks.
Thanks for your question. A generated .npy file with dimension(67, 1024) means 67 feature vectors with 1024 dimensions each, and the order of the feature vectors corresponds with the speech. Our designed model is for frame-level features, since we focus on discovering the emotional information at local scales. Therefore, utterance level feature is not suitable for our framework. As mentioned in paper, variation of pitch and speech rate, sound of laughter and sigh, and word could represent emotion. Therefore, to better recognize emotion in speech, both semantic information and para-lingual information need to be considered. Since the top layers of WavLM are suitable for semantic information extraction and the lower layers of WavLM are utilized for para-lingual information extraction, we use the 12th layer of WavLM Large for feature extraction.
Thank you so much for your detailed explanation! I really appreciate it!
The extracted wavLM features have different dimensions for each input file. The shape of numpy feature vector of a input file is (67,1024). Does this mean there are 67 feature vectors with each of 1024 dimensions? Are these the frame level features? Can we pool over these feature vectors to have (1,1024) shape? Would this then be the utterance level feature?
Thanks,