jame25 / Piper-Read

Piper Read is a lightweight Piper TTS GUI written in C#.
9 stars 0 forks source link

[Feature] Async .wav generation #3

Closed jesherman closed 3 months ago

jesherman commented 5 months ago

Right now it generates the entire text file into a single .wav - this can take up to 10-20 seconds on a high-end PC (13900k) before starting and means any re-conversion will also take a long time.

Alternatively, you can queue up multiple .wav files in 15-30 second chunks (or after the end of parsing a sentence / paragraph) to allow it to sequence audio files back to back and have faster audio generation.

jesherman commented 5 months ago

perhaps eventually even windows API real-time output (would eliminate any need to chunk WAVs and adjust speed in real time, allow for text highlighting, etc.) https://stackoverflow.com/questions/17452300/dumping-wave-audio-to-stdout-using-windows-api

jame25 commented 5 months ago

Just to update you regarding current development; I've taken the decision to do a complete rebuild of the application and now have a working prototype of Piper Read that uses NAudio to directly stream audio output (from Piper) via the Windows API.

Application takes between 1-2 seconds to start reading aloud on my development PC (10700k). I hope to upload this new version by the weekend, if not sooner.

jame25 commented 5 months ago

I'd appreciate some feedback on this pre-release version.

https://github.com/jame25/Piper-Read/releases/tag/v1.0.5

Regarding other new features, I will post in those specific threads shortly.

jesherman commented 5 months ago

Thanks for working on this. Will test around with it a bit more but a few immediate observations:

jame25 commented 5 months ago

FYI, I uploaded a new version with dictionary rules (ignore.dict/banned.dict) implemented just before you posted.

On your first point, that's certainly a curious and unintentional effect. I will attempt to resolve it. On your second point, I had been doing my testing with The Guardian articles, and the style there is different - mostly large paragraphs. This build is actually designed to generate audio for the first paragraph, and while playback is happening, prepare the subsequent paragraph(s). An empty line is interpreted as a new paragraph. Obviously this idea works better with The Guardian than NYT content.

Also you previously mentioned you use a specific voice, I do most of my testing with en_US-libritts_r-medium.onnx - which is where I get the 1-2 second start figure from.

jame25 commented 5 months ago

I uploaded a new version (105b) that offers a solution for full stop being pronounced in some scenerios. The archive includes a replace.dict (with the 'dot' fix already applied) that can be edited with words to be replaced on each new line. for example:

sky=tree LHC=Large Hadron Collider

jesherman commented 5 months ago

Hey thanks, these are all great improvements.

Maybe it takes an extraterrestrial event to bring this shredded country together. For a phenomenon that traversed the country from the contentious southern border to the far reaches of New England, Monday’s eclipse attracted remarkably few conspiracy theories or accusations. From where I stood, in Buffalo, the major threat to the moment was a forecast of heavy clouds.

Bring on the ominous metaphors: We don’t have the foggiest idea where we’re going. This year, the eclipse passes America by. Here comes the rain again.

Perhaps I was too primed to seek meaning, having found unexpected significance in the last major eclipse to cross the country, back on Aug. 21, 2017. I needed it.

Wearied by the chaotic churn of Donald Trump’s presidency and desperate for a vacation, I told my family I wanted to see something in this country Trump couldn’t bash, alter, destroy or tarnish. I wanted mountains, rock structures, landscapes and vistas that would give me that sense of This Too Shall Pass, and the planet will still be around. We decided to spend 10 days in South Dakota, starting at Mount Rushmore and ending in the Badlands.
jame25 commented 5 months ago

I've finally managed to get piper.exe to concurrently run during audio playback of previous paragraph. This has also reduced the pause between paragraphs, and sentences divided by a new line. Speaking start time is still determined by the voice model used; with en_US-libritts_r-medium taking two seconds, and en_US-ljspeech-high taking around four seconds, on my system.

v1.0.5-final is here: https://github.com/jame25/Piper-Read/releases/tag/v1.0.5

jesherman commented 5 months ago

works great! Yes as observed the high quality models take a bit longer but with decent (20+ words) paragraphs it is very seamless.

This is a minor tweak and could be a separate issue -- does it have to take a static copy of the text once it /starts/ conversion? if so you should probably make the text box non-editable. I removed a paragraph and added a paragraph mid-conversion and it didn't change the async audio generation