ken107 / read-aloud

An awesome browser extension that reads aloud webpage content with one click
https://readaloud.app
MIT License
1.38k stars 236 forks source link

Option to Output Audio to File #159

Open ericpelot opened 4 years ago

ericpelot commented 4 years ago

I know some version of this has been requested before, but it's been several years and some things have changed since then. With the inclusion of the option to provide our own API keys, we now have the ability to take advantage of the generous free-tiers of some services.

The enhancement I'm imagining would work very similar to how the browser extension already works, but in place of outputting to the speaker, would instead concatenate the clips to an audio file (.mp3, .ogg, .wav,...etc - whatever is easiest to code) to be downloaded/saved. It wouldn't need to be (and probably shouldn't be) significantly faster than the speaker output due to keeping the "characters per minute" of different services in mind. But just having the option to make the output portable in some way would be a terrific enhancement. I hope you'll consider it!

Thank you for an already awesome extension!

ken107 commented 4 years ago

Indeed we could create the download option when using own API keys. Since we're not playing back the speech, we could synthesize the entire article into one audio file, and avoid concatenation.

But have you tried out this tool? https://ttstool.com/polly.html and https://ttstool.com/wavenet.html

ericpelot commented 4 years ago

Thank you for replying! I did check out the wavenet site and it does look like a cool site :) A couple things about it that make me still prefer the method above is that 1.) There are no speed/pitch controls on the site and 2.) For longer conversions, the site throws the "{ "error": { "code": 400, "message": "5000 characters limit exceeded.", "status": "INVALID_ARGUMENT" } }" and does not intelligently chunk the text up into digestible segments, and then string the audio back together (whereas the extension does). If the site were updated to do those things, that could absolutely be a wonderful alternative solution! Thanks again!

ken107 commented 4 years ago

There's an issue with concatenation. It's not simply stringing the files together. We have to strip the header from each file, concatenate the payload, and generate a new header for the entire large file. Usually we need to use special tools like FFMPEG or Sox for this purpose.

The Read Aloud extension isn't actually doing concatenation. It's just playing the audio files one by one.

There's another, smaller issue. Depending on the voice engine, the generated audio files may need to be padded with silence to create the necessary end-of-paragraph pause, before concatenation.

It's ugly. I had this setup on the server for SiteSpeaker, using FFMPEG to add silence and for concatenation. The problem for TTSTool and Read Aloud is the synthesized audio files go directly from Google Cloud/AWS to the browser. My server is never involved--for efficiency, and for the security of your API keys. Maybe there's an open-source JavaScript tool for audio processing, will need to do some research.

For now, the 5000 characters is actually your Google Cloud quota. You may want to go to the Google Cloud console and try to increase that quota. I think I got mine increased to 150,000.

ericpelot commented 4 years ago

That does sound kind of complicated, Thank you for researching it! I tried looking at the quotas In the console, but I think there are two different kinds of quotas: request and content. The content quota is 5000 characters per request. I was able to find the page to modify the requests quotas but was not able to find the page that let me adjust the content quota. I’ll keep looking though, because that alone could be a big help!

ericpelot commented 4 years ago

It does look like the 5000 characters per request is a hard limit (will be happy to be wrong about this if anyone can point me in the right direction).

Regarding concatenation difficulties, if it ends up looking to be too involved - I'd even be satisfied if I could pick a directory and have all the individual clips saved separately and then be forced to concatenate them myself (or just play them in order). Being able to achieve portability of any sort would still be a huge bonus! To my knowledge there is no tool currently providing this capability, how cool if this were to be the first :)

m-majetic commented 2 years ago

Hi! I got interested in this feature as well and it really looks like there's no way to do it reliably yet. At least until providing the AudioBuffer is implemented into the browser spec.

The way I see it at first glance, it can kind of be implemented and the StackOverflow answers here seem promising. However, the answers and the code make it seem like you would first need to listen to all of the audio in order to save it via the MediaRecorder API.

I tried searching for a Chrome native API that exposes the AudioBuffer in case there were any new changes that flew under the radar, but it looks like the Chrome API only allows AudioBuffer input that you could provide from the outside by another TTS engine using the chrome.ttsEngine.onSpeakWithAudioStream as seen here. So it's only take, but no give...

@ken107, @ericpelot Maybe we could band together and suggest this to be included in the spec if it's worth enough for you two. I know I'd love to have it, if for nothing else, then for Chrome. It's native TTS is magnificent in my opinion.