jasonppy / VoiceCraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild
Other
7.65k stars 748 forks source link

use facebook's pretrained encodec model #144

Closed thivux closed 4 months ago

thivux commented 4 months ago

i am experimenting voicecraft with another language and i don't want to retrain a custom encodec model for that language. have anyone tried training voicecraft with 8-codebook encodec checkpoint from facebook? how is the result compared to using 4 codebooks?

jasonppy commented 4 months ago

From other people's feedback, you don't need to retrain encodec on your target language. Just finetuning the LLM is enough

thivux commented 4 months ago

thanks @jasonppy for the info. if i have a large dataset (~5k hours of audio) in my target language (Vietnamese), should i go with fine-tuning the LLM or retrain a new one?

jasonppy commented 4 months ago

You could try both


From: Thi Vũ @.> Sent: Monday, July 1, 2024 11:54:55 PM To: jasonppy/VoiceCraft @.> Cc: Puyuan Peng @.>; Mention @.> Subject: Re: [jasonppy/VoiceCraft] use facebook's pretrained encodec model (Issue #144)

thanks @jasonppyhttps://github.com/jasonppy for the info. if i have a large dataset (~5k hours of audio) in my target language (Vietnamese), should i go with fine-tuning the LLM or retrain a new one?

— Reply to this email directly, view it on GitHubhttps://github.com/jasonppy/VoiceCraft/issues/144#issuecomment-2201838070, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALMEZCKJJ6UKZKJWFR2FZWDZKIQA7AVCNFSM6AAAAABKGFHJW6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBRHAZTQMBXGA. You are receiving this because you were mentioned.Message ID: @.***>

thivux commented 4 months ago

thank you. closing the issue.

justinatbahasa commented 3 months ago

hi @thivux, curious about your results, does by only finetune the trained model with new language is enough? or train new model from scratch is preferable? thanks

thivux commented 3 months ago

@justinatbahasa hi, from my experience, finetuning the pretrained checkpoint with a smaller amount of data (400 hours vietnamese + 100 hours english) gave even better results than training from scratch (5k hours vietnamese)

justinatbahasa commented 3 months ago

cool! thanks

Magauiya commented 2 months ago

Hi @thivux @justinatbahasa ! Do you have any clue why training from scratch gives worse results than finetuning? Have you guys followed the same training scheme as the original paper when you trained from scratch (4 gpus LR=0.01 and had no "is nan, therefore skip this batch" issues)?