jasonppy / VoiceCraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild
Other
7.23k stars 708 forks source link

Question: VRAM requirements for training, finetuning, and inference? #76

Open ProjectProgramAMark opened 3 months ago

ProjectProgramAMark commented 3 months ago

Do we have a general sense on this? Has LoRA/QLoRA fine tuning been attempted on this, and if so, any guidance?

jasonppy commented 3 months ago

Thanks

Inference: For the default example in the demo (the one in inference_tts.ipynb), For the 830M model, it needs around 22GB with kvcache on (i.e. kvcache=1), 12GB with kvcache off; for the 330M model, 15GB with kvcache on, 5GB with kvcache off

Training: 48GB

LoRA is not used so far

ProjectProgramAMark commented 3 months ago

Awesome, thank you for the quick response! I'm hoping to see some LoRA/QLoRA action on this soon. I think something like being able to switch out adapter weights on a base model and having different voices come out of it is something that would be so cool to see. I will try and push that through myself if I have the time (if you have any recommendations on which layers to apply it to that would be great!), but regardless I think this is awesome and I'm excited to start playing around with it