Can larger whisper model improve perfomance?

navi0105 / LyricAlignment

Source code of paper "Adapting pretrained speech model for Mandarin lyrics transcription and alignment"

10 stars 1 forks source link

Can larger whisper model improve perfomance? #3

Closed SuperiorDtj closed 4 days ago

SuperiorDtj commented 1 month ago

just as the title

navi0105 commented 1 month ago

Due to the lack of computing resources, we have only tried the fine-tuning of the Whisper Large-v2 decoder (freeze encoder), apologize for not leaving exact experimental data, but my impression is that the results of lyrics transcription are better than using Whisper Medium, but the accuracy of lyrics alignment is worse. Although not tested, I think using PEFT method such as LoRA to fine-tune larger Whisper model with limited computing resources for lyrics transcription / alignment task is a viable option, FYI.

SuperiorDtj commented 1 month ago

Thank you for your suggestion! I noticed that the released checkpoint is based on the medium version of Whisper. Will you consider open-sourcing a fine-tuned version based on the larger Whisper model in the future?

navi0105 commented 1 month ago

Since I'm about to graduate from grad school, I can't guarantee that I'll still have enough computing resources to conduct more experiments in the future, so there are no plans to release other fine-tuned Whisper checkpoints, sorry.

SuperiorDtj commented 4 days ago

Since I'm about to graduate from grad school, I can't guarantee that I'll still have enough computing resources to conduct more experiments in the future, so there are no plans to release other fine-tuned Whisper checkpoints, sorry.

Thank you for your work on the fine-tuned Whisper checkpoints! I completely understand your situation, and I appreciate the effort you've put in. Best of luck with your graduation and future endeavors!