[Feature] Async .wav generation

jesherman commented 5 months ago

Right now it generates the entire text file into a single .wav - this can take up to 10-20 seconds on a high-end PC (13900k) before starting and means any re-conversion will also take a long time.

Alternatively, you can queue up multiple .wav files in 15-30 second chunks (or after the end of parsing a sentence / paragraph) to allow it to sequence audio files back to back and have faster audio generation.

jesherman commented 5 months ago

perhaps eventually even windows API real-time output (would eliminate any need to chunk WAVs and adjust speed in real time, allow for text highlighting, etc.) https://stackoverflow.com/questions/17452300/dumping-wave-audio-to-stdout-using-windows-api

jame25 commented 5 months ago

Just to update you regarding current development; I've taken the decision to do a complete rebuild of the application and now have a working prototype of Piper Read that uses NAudio to directly stream audio output (from Piper) via the Windows API.

Application takes between 1-2 seconds to start reading aloud on my development PC (10700k). I hope to upload this new version by the weekend, if not sooner.

jame25 commented 5 months ago

I'd appreciate some feedback on this pre-release version.

https://github.com/jame25/Piper-Read/releases/tag/v1.0.5

Regarding other new features, I will post in those specific threads shortly.

jesherman commented 5 months ago

Thanks for working on this. Will test around with it a bit more but a few immediate observations:

sentence generation is quite different between this build and last build. On one hand it's detecting commas/sentences a lot better with proper pauses. However, it seems to keep saying "dot" at the end of a sentence with a quote ("A sentence quoting someone.") such as in most news articles from sites like the New York Times. Not sure if this is a change or just something unrelated.
It only seems to generate audio one sentence at a time. It finishes a sentence, then it spins up piper to start doing the next one which on my machine (13900k) seems to take about 3-4 seconds for a 40 words paragraph -- perhaps there is a more elegant solution where you transcribe one or two sentences at a time, and then start transcribing after some arbitrary or calculated amount of time to allow for fewer pauses between sentences? Otherwise it can be a bit jarring for complex paragraphs and actually end with more total time waiting vs. a single wav generated at the beginning

jame25 commented 5 months ago

FYI, I uploaded a new version with dictionary rules (ignore.dict/banned.dict) implemented just before you posted.

On your first point, that's certainly a curious and unintentional effect. I will attempt to resolve it. On your second point, I had been doing my testing with The Guardian articles, and the style there is different - mostly large paragraphs. This build is actually designed to generate audio for the first paragraph, and while playback is happening, prepare the subsequent paragraph(s). An empty line is interpreted as a new paragraph. Obviously this idea works better with The Guardian than NYT content.

Also you previously mentioned you use a specific voice, I do most of my testing with en_US-libritts_r-medium.onnx - which is where I get the 1-2 second start figure from.

jame25 commented 5 months ago

I uploaded a new version (105b) that offers a solution for full stop being pronounced in some scenerios. The archive includes a replace.dict (with the 'dot' fix already applied) that can be edited with words to be replaced on each new line. for example:

sky=tree LHC=Large Hadron Collider

jesherman commented 5 months ago

Hey thanks, these are all great improvements.

I am not observing the build run piper while also speaking the previous paragraph if this is what you mean by "while playback is happening, prepare the subsequent paragraph(s)." At least with en_US-ljspeech-high. It seems (based on task manager activity) to start running piper.exe only after it finishes reciting the previous paragraph in my usage/testing. Below is a text sample where I observed this with exact spacing as shown (copy-pasted from the article) .
FYI your beta build doesn't include banned.dict and ignore.dict, so it will catch errors until those are added back in
replace.dict is a smart solution, thanks for adding it -- JFYI the current version doesn't support "=" not that it's a huge issue
As another QoL you may want to support commenting (# exclude this line) and you can use that to include examples/context.

Maybe it takes an extraterrestrial event to bring this shredded country together. For a phenomenon that traversed the country from the contentious southern border to the far reaches of New England, Monday’s eclipse attracted remarkably few conspiracy theories or accusations. From where I stood, in Buffalo, the major threat to the moment was a forecast of heavy clouds.

Bring on the ominous metaphors: We don’t have the foggiest idea where we’re going. This year, the eclipse passes America by. Here comes the rain again.

Perhaps I was too primed to seek meaning, having found unexpected significance in the last major eclipse to cross the country, back on Aug. 21, 2017. I needed it.

Wearied by the chaotic churn of Donald Trump’s presidency and desperate for a vacation, I told my family I wanted to see something in this country Trump couldn’t bash, alter, destroy or tarnish. I wanted mountains, rock structures, landscapes and vistas that would give me that sense of This Too Shall Pass, and the planet will still be around. We decided to spend 10 days in South Dakota, starting at Mount Rushmore and ending in the Badlands.

jame25 commented 5 months ago

I've finally managed to get piper.exe to concurrently run during audio playback of previous paragraph. This has also reduced the pause between paragraphs, and sentences divided by a new line. Speaking start time is still determined by the voice model used; with en_US-libritts_r-medium taking two seconds, and en_US-ljspeech-high taking around four seconds, on my system.

v1.0.5-final is here: https://github.com/jame25/Piper-Read/releases/tag/v1.0.5

jesherman commented 5 months ago

works great! Yes as observed the high quality models take a bit longer but with decent (20+ words) paragraphs it is very seamless.

This is a minor tweak and could be a separate issue -- does it have to take a static copy of the text once it /starts/ conversion? if so you should probably make the text box non-editable. I removed a paragraph and added a paragraph mid-conversion and it didn't change the async audio generation

jame25 / Piper-Read

[Feature] Async .wav generation #3