Open marcoyang1998 opened 1 month ago
A comparison of using different fbank features to decode Whisper model. The LilcomChunkyWriter
stores compressed fbank features, causing a slight mismatch when inferencing the Whisper model. NumpyHdf5Writer
stores the un-compressed fbank feature, but requires more storage.
In general, using the un-compressed features is slightly better than using the compressed features. The performance difference is minor, except for large-v2. The WERs are obtained using greedy search.
model name | feature type | WER |
---|---|---|
small | Lilcom | 4.59/10.46 |
small | hdf5 | 4.57/10.11 |
small.en | Lilcom | 4.83/11.06 |
small.en | hdf5 | 4.82/11.04 |
medium | Lilcom | 4.02/7.53 |
medium | hdf5 | 4.04/7.53 |
medium.en | Lilcom | 3.72/7.69 |
medium.en | hdf5 | 3.72/7.65 |
large-v2 | Lilcom | 4.37/8.03 |
large-v2 | hdf5 | 4.25/7.68 |
large-v3 | Lilcom | 3.73/6.1 |
large-v3 | hdf5 | 3.73/6.1 |
Num epoch = 10, with Lilcom compressed features Only fine-tune on train-clean-100.
Without fine-tuning: 4.83/11.06 (greedy)
Freeze modules | Num trainable | test-clean/test-other |
---|---|---|
None | 241M | Greedy: 3.35/7.22, Beam search: 3.28/6.63 |
encoder | 154M | Greedy: 3.67/7.81, Beam search: 3.51/7.17 |
decoder | 87M | Greedy: 3.14/7.37, Beam search: 3.02/6.98 |
Num epoch = 10, with Lilcom compressed features Without fine-tuning: 4.02/7.53 (greedy)
Freeze modules | Num trainable | test-clean/test-other |
---|---|---|
None | 762M | Greedy: 2.82/5.88, Beam search: 2.74/5.56 |
encoder | 457M | Greedy: 3.2/6.41, Beam search: 3.02/6.0 |
decoder | 356M | Greedy: 2.81/7.38, Beam search: 2.64/5.85 |
Model: small.en (without fine-tune: 4.83/11.06) | learning rate | test-clean/test-other |
---|---|---|
1e-4 | 4.77/10.48 | |
5e-5 | 3.8/8.12 | |
1e-5 | 3.35/7.22 | |
5e-6 | 3.24/7.01 |
Model: medium (without fine-tune: 4.02/7.53) | learning rate | test-clean/test-other |
---|---|---|
5e-5 | 6.81/14.76 | |
1e-5 | 2.82/5.88 | |
5e-6 | 2.79/5.74 |
This recipe finetunes a Whisper model on LibriSpeech following #1466.