Open BryceBarbara opened 1 month ago
I am also highly interested in this. Since piper models use onnx and transformers.js provides GPU inference for onnx model, I feel like that might be another way to accomplish this with a higher-level library.
I think there might be some others also interested in this from other projects: https://github.com/diffusionstudio/vits-web/issues/3
If combined with the ability to export audio as mp3, I think it would be amazing. It would allow audiobooks to be created super easily and with awesome UX in the browser. https://github.com/ken107/read-aloud/issues/7 https://github.com/ken107/read-aloud/issues/159
If anyone has ideas on this, please reach out. I would love to hack on this but am unsure where to start
I cant' recall where, but I saw multiple discussions about how Piper inferencing using GPU doesn't offer much performance improvement over CPU. Moreover, GPU support in Piper is not yet mature and still has issues. When I was R&Ding for https://github.com/ken107/piper-browser-extension, I tried doing GPU inferencing on my RTX 3060 and ran into some problem with unsuported operators. Not a machine-learning expert, I couldn't resolve the issue. Anyway, just adding my experience.
I love the new Piper feature that allows for some better sounding voices to read text!
I've run into the issue that it can take a bit of time before you hear the first bits of audio. I assume this is due to the JavaScript inference engine doing things all on the CPU. On my work laptop, my CPU is gobbled up by various developer apps running (heck, I've got like 13 instances of chrome running thanks to everyone using electron).
I was wondering, would the time to first sound (let's call this TTFS) be lower if we used the GPU?
For a quick search, it appears there are a few options for doing that: