rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
5.79k stars 419 forks source link

Piper running on iOS as a system voice! Anyone with iOS experience able to troubleshoot issues or improve this code? #521

Open S-Ali-Zaidi opened 3 months ago

S-Ali-Zaidi commented 3 months ago

I’m not sure if anyone noticed, but there is a swift-native implementation of Piper that allows it to run on iOS -- and to have Piper models be used as the system voice for iOS and MacOS (for iOS text-to-speech operations such as spoken content, live speech). This also allows Piper voices to be natively accessible from other iOS apps (such as book readers!) that tap into iOS system voices.

This is thanks to the AVSpeechSynthesisProviderVoice and AVSpeechSynthesisProviderAudioUnit frameworks in Swift. Some documentation on it here:

@IhorSchevchuk’s has wrapped this all together into Swift Packages and an iOS app for Piper, Piper Phonemize, Onnx Runtime, etx.

I’ve tested it and gotten it working -- however, I’m finding that high quality models are not running on my iPhone 13, and iOS is reverting to using the “compact Siri Voice”.

With medium quality models, my iPhone 13 is running short phrases, but longer phrases cause it to kick back to using the compact Siri voice. See demos here using the en-GB_Cori-Medium voice:

https://github.com/rhasspy/piper/assets/122964093/7ac1f682-10f8-45fd-90f2-a55be4bfa813

https://github.com/rhasspy/piper/assets/122964093/3a61e513-4b79-4385-8c1f-bbf3b32f421f

I’m a bit confused by this behavior -- given that I’m able to run 1b-3b parameter LLMs on my phone without too much issue. Equally confused because I’m able to use Piper medium-quality models through a WASM interface just fine on iOS, such as this one by K2-FSA.

Wondering if anyone has any insights or experience on this end -- or knows how to get Piper models running properly on native swift apps! Love that I can get it working as a system voice -- just need to get it working consistently.

S-Ali-Zaidi commented 3 months ago

Here’s another example of the issue -- then using the En_GB_Jenny-Medium voice on the Speech Central App. You can see how it’s performing quite responsively until we get to a longer sentence in the book. I’ve posted an issue with the author of the Repo, but hoping someone else here might have gotten Piper working smoothly on their iOS devices!

https://github.com/rhasspy/piper/assets/122964093/f53b598f-54e1-4077-ba41-00502eff639d

scalar27 commented 3 months ago

Wow. I'm interested to learn more.

vinylrichie commented 3 weeks ago

@S-Ali-Zaidi You ever this this sorted out?

W1Real commented 2 weeks ago

It's probably falling back because of the latency, perhaps apple gives a deadline for generation time to keep TTS latency low. I guess if it were possible to chunk the sentence and stream the shorter chunks and return them as they are generated? (Although perhaps with introduced unnatural pauses).

lumpidu commented 2 weeks ago

The model needs too much memory for an Audio Unit Extension which is the format a TTS program needs to have, if it provides a system-wide TTS voice on iOS. Even the Piper XS voices need up to a few hundred megabytes of RAM at runtime and there is only 120MB allowed e.g. on a iPhone13 Pro Max. So if the system encounters that the Audio Unit Extension needs more than the allowed amount RAM, it will kill it. This doesn't have to do anything with latency.