k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
792 stars 267 forks source link

Finetune Whisper model on LibriSpeech #1571

Open marcoyang1998 opened 1 month ago

marcoyang1998 commented 1 month ago

This recipe finetunes a Whisper model on LibriSpeech following #1466.

marcoyang1998 commented 1 month ago

A comparison of using different fbank features to decode Whisper model. The LilcomChunkyWriter stores compressed fbank features, causing a slight mismatch when inferencing the Whisper model. NumpyHdf5Writer stores the un-compressed fbank feature, but requires more storage.

In general, using the un-compressed features is slightly better than using the compressed features. The performance difference is minor, except for large-v2. The WERs are obtained using greedy search.

model name feature type WER
small Lilcom 4.59/10.46
small hdf5 4.57/10.11
small.en Lilcom 4.83/11.06
small.en hdf5 4.82/11.04
medium Lilcom 4.02/7.53
medium hdf5 4.04/7.53
medium.en Lilcom 3.72/7.69
medium.en hdf5 3.72/7.65
large-v2 Lilcom 4.37/8.03
large-v2 hdf5 4.25/7.68
large-v3 Lilcom 3.73/6.1
large-v3 hdf5 3.73/6.1
marcoyang1998 commented 1 month ago

Effect of freezing different modules

Num epoch = 10, with Lilcom compressed features Only fine-tune on train-clean-100.

Finetune small.en, adam optimizer, lr=1e-5

Without fine-tuning: 4.83/11.06 (greedy)

Freeze modules Num trainable test-clean/test-other
None 241M Greedy: 3.35/7.22, Beam search: 3.28/6.63
encoder 154M Greedy: 3.67/7.81, Beam search: 3.51/7.17
decoder 87M Greedy: 3.14/7.37, Beam search: 3.02/6.98

Finetune medium, adam optimizer, lr=1e-5

Num epoch = 10, with Lilcom compressed features Without fine-tuning: 4.02/7.53 (greedy)

Freeze modules Num trainable test-clean/test-other
None 762M Greedy: 2.82/5.88, Beam search: 2.74/5.56
encoder 457M Greedy: 3.2/6.41, Beam search: 3.02/6.0
decoder 356M Greedy: 2.81/7.38, Beam search: 2.64/5.85
marcoyang1998 commented 1 month ago

Effect of different learning rates:

Model: small.en (without fine-tune: 4.83/11.06) learning rate test-clean/test-other
1e-4 4.77/10.48
5e-5 3.8/8.12
1e-5 3.35/7.22
5e-6 3.24/7.01
Model: medium (without fine-tune: 4.02/7.53) learning rate test-clean/test-other
5e-5 6.81/14.76
1e-5 2.82/5.88
5e-6 2.79/5.74