pip install peft-ser
# whisper style loading
import torch
import peft_ser
model = peft_ser.load_model("whisper-base-lora-16-conv")
data = torch.zeros([1, 16000])
output = model(data)
logits = torch.softmax(output, dim=1)
The output emotion mappings are: {0: "Neutral", 1: "Angry", 2: "Sad", 3: "Happy"}. We would add a version for 6-emotion later.
For all the released models, we train/evaluate with the same data. Unlike the ACII paper where the audio was restricted to 6s, these open-release models support the audio duration to the maximum of 10s for broader use cases. We also combine the convolutional output along with the transformer encodings for fine-tuning, as we find this further increase the model performance. We used a fix seed of 8, training epoch of 30, and learning rate of 2.5x10e-4.
The validation set: Session 4 of IEMOCAP and Session 4 of MSP-Improv dataset, Validation set of MSP-Podcast dataset, and Speaker 1059-1073
The evaluation set: Session 5 of IEMOCAP and Session 5 of MSP-Improv dataset, Test set of MSP-Podcast dataset, and Speaker 1074-1091
All rest data are used for training.
Pre-trained Model | Test Performance without PEFT | Test Performance with LoRa | PEFT Model Name |
---|---|---|---|
Whisper Tiny | 62.26 | 63.48 | whisper-tiny-lora-16-conv |
Whisper Base | 64.39 | 64.92 | whisper-base-lora-16-conv |
Whisper Small | 65.77 | 66.01 | whisper-small-lora-16-conv |
WavLM Base+ | 63.06 | 66.11 | wavlm-plus-lora-16-conv |
WavLM Large | 68.54 | 68.66 | wavlm-large-lora-16-conv |
To begin with, please clone this repo:
git clone git@github.com:usc-sail/peft-ser.git
To install the conda environment:
cd peft-ser
conda env create -f peft-ser.yml
conda activate peft-ser
Please specify the data file to your work dir under config/config.yml
data_dir:
crema_d: CREMA_D_PATH
iemocap: IEMOCAP_PATH
msp-improv: MSP-IMPROV_PATH
msp-podcast: MSP-PODCAST_PATH
project_dir: OUTPUT_PATH
For most of the dataset, user need first split the train/dev/test by the given script file. Take the IEMOCAP data as instance:
cd train_split_gen
python3 iemocap.py
For most of the dataset, user can generate the preprocessed audio file by the given script file. The preprocessing includes resample to 16kHz and to mono channel. Take the IEMOCAP data as instance:
cd preprocess_audio
python3 preprocess_audio.py --dataset iemocap
# dataset: iemocap, msp-improv, msp-podcast, crema_d
The script will generate the folder under your working dir:
OUTPUT_PATH/audio/iemocap
To finetune the downstream model with a pretrained backbone, use the following:
cd experiment
CUDA_VISIBLE_DEVICES=0, taskset -c 1-60 python3 finetune_emotion.py --pretrain_model wavlm_plus --dataset crema_d --learning_rate 0.0005 --num_epochs 30 --finetune_method finetune
To use adapter to train the downstream model with a pretrained backbone, use the following:
cd experiment
CUDA_VISIBLE_DEVICES=0, taskset -c 1-60 python3 finetune_emotion.py --pretrain_model wavlm_plus --dataset crema_d --learning_rate 0.0005 --num_epochs 30 --finetune_method adapter --adapter_hidden_dim 128
To use embedding prompt to train the downstream model with a pretrained backbone, use the following:
cd experiment
CUDA_VISIBLE_DEVICES=0, taskset -c 1-60 python3 finetune_emotion.py --pretrain_model wavlm_plus --dataset crema_d --learning_rate 0.0005 --num_epochs 30 --finetune_method embedding_prompt --embedding_prompt_dim 5
To use LoRa to train the downstream model with a pretrained backbone, use the following:
cd experiment
CUDA_VISIBLE_DEVICES=0, taskset -c 1-60 python3 finetune_emotion.py --pretrain_model wavlm_plus --dataset crema_d --learning_rate 0.0005 --num_epochs 30 --finetune_method lora --lora_rank 16
The output will be under: OUTPUT_PATH/result/, the output metrics is UAR, and the higher the metric, the better the performance.