skirdey / voicerestore

VoiceRestore: Flow-Matching Transformers for Universal Speech Restoration
MIT License
88 stars 9 forks source link

Low inference speed #6

Closed AnticPan closed 1 month ago

AnticPan commented 1 month ago

Hi there,

I've noticed that processing a 2-minute audio file takes about 3.5 minutes with the current default settings in inference_long.py. I'm wondering if there are any plans to optimize the speed. For reference, the https://github.com/SWivid/F5-TTS project, which also uses matching flow, achieves an inference RTF of 0.15.

Thank you for your hard work, and I appreciate any insights you can provide!

skirdey commented 1 month ago

Hi! Thank you for reporting the slowness. The most of the inference latency comes from the use of BigVGAN vocoder. If you are on a linux machine with nvidia gpu you can try setting BigVGAN to use cuda kernels. Currently it is set to False to maintain compatibility with other hardware. I am planning to release a new, smaller and more capable model soon, so hopefully it can reduce latency as well.