k2-fsa / icefall

https://k2-fsa.github.io/icefall/

Apache License 2.0

792 stars 267 forks source link

Finetune Whisper model on LibriSpeech #1571

Open marcoyang1998 opened 1 month ago

marcoyang1998 commented 1 month ago

This recipe finetunes a Whisper model on LibriSpeech following #1466.

[ ] Update the results
[ ] Compare full fine-tune and partial (e.g. encoder/decode) fine-tune
[ ] Compare AdamW and ScaledAdam
[x] Compare the compressed fbank feature (Lilcom) and un-compressed fbank feature (hdf5) as input

marcoyang1998 commented 1 month ago

A comparison of using different fbank features to decode Whisper model. The LilcomChunkyWriter stores compressed fbank features, causing a slight mismatch when inferencing the Whisper model. NumpyHdf5Writer stores the un-compressed fbank feature, but requires more storage.

In general, using the un-compressed features is slightly better than using the compressed features. The performance difference is minor, except for large-v2. The WERs are obtained using greedy search.

model name	feature type	WER
small	Lilcom	4.59/10.46
small	hdf5	4.57/10.11
small.en	Lilcom	4.83/11.06
small.en	hdf5	4.82/11.04
medium	Lilcom	4.02/7.53
medium	hdf5	4.04/7.53
medium.en	Lilcom	3.72/7.69
medium.en	hdf5	3.72/7.65
large-v2	Lilcom	4.37/8.03
large-v2	hdf5	4.25/7.68
large-v3	Lilcom	3.73/6.1
large-v3	hdf5	3.73/6.1

marcoyang1998 commented 1 month ago

Effect of freezing different modules

Num epoch = 10, with Lilcom compressed features Only fine-tune on train-clean-100.

Finetune small.en, adam optimizer, lr=1e-5

Without fine-tuning: 4.83/11.06 (greedy)

Freeze modules	Num trainable	test-clean/test-other
None	241M	Greedy: 3.35/7.22, Beam search: 3.28/6.63
encoder	154M	Greedy: 3.67/7.81, Beam search: 3.51/7.17
decoder	87M	Greedy: 3.14/7.37, Beam search: 3.02/6.98

Finetune medium, adam optimizer, lr=1e-5

Num epoch = 10, with Lilcom compressed features Without fine-tuning: 4.02/7.53 (greedy)

Freeze modules	Num trainable	test-clean/test-other
None	762M	Greedy: 2.82/5.88, Beam search: 2.74/5.56
encoder	457M	Greedy: 3.2/6.41, Beam search: 3.02/6.0
decoder	356M	Greedy: 2.81/7.38, Beam search: 2.64/5.85

marcoyang1998 commented 1 month ago

Effect of different learning rates:

Model: small.en (without fine-tune: 4.83/11.06)	learning rate	test-clean/test-other
1e-4	4.77/10.48
5e-5	3.8/8.12
1e-5	3.35/7.22
5e-6	3.24/7.01

Model: medium (without fine-tune: 4.02/7.53)	learning rate	test-clean/test-other
5e-5	6.81/14.76
1e-5	2.82/5.88
5e-6	2.79/5.74