ken107 / read-aloud

An awesome browser extension that reads aloud webpage content with one click
https://readaloud.app
MIT License
1.31k stars 226 forks source link

Nvidia Riva support #321

Closed kfatehi closed 1 year ago

kfatehi commented 1 year ago

This PR adds support for Nvidia Riva. The Riva stack serves a gRPC service, which seemed non-trivial (impossible?) to interface with from the extension environment, so in this implementation I am relying on a companion web service that conforms an HTTP GET request similar to that used for IBM Watson for riva in order to return the desired ogg file.

Audio Samples

riva-english-female-1.webm

riva-english-male-1.webm

Screenshots

image

image

image

kfatehi commented 1 year ago

Converted it to a draft because I forgot to implement the prosody features, which Riva supports, so it's just a matter of piping them through the proxy.

But would this PR even be approved? I think it's important to support Riva considering it's a fully offline, yet very high quality solution. One just need to have the GPU power to run the models. By default, it seems to consume around 13 GB of GPU memory to run the default Riva stack... so I can only do this because I'm using an Nvidia 4090... Still pretty cool though.

kfatehi commented 1 year ago

I would also want to reduce the blind code-copying I did of the IBM engine, prior to this being merged, but I wanted to get it out there because it does work and I want to see if there is interest for me to complete it to a higher standard of quality. As it stands now it works for me though.

ken107 commented 1 year ago

I'll merge it. This is indeed very cool. Even though most users right now won't have the hardware needed to run it locally, the idea of having an offline next-gen TTS voice that's free to use is the future we'd like to get to. When the time comes, perhaps the TTS subsystem can be deployed as a native app and extensions communicate with it via native messaging. (For reference, https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials-tts-contents.html)

kfatehi commented 1 year ago

Ok great. I will clean things up, implement prosody control, and notify you when it's ready for review.

What do you mean by native messaging? I am not too thrilled about interjecting a proxy between the extension and Riva, but I see I cannot use TCP directly in chrome extension API ? https://developer.chrome.com/docs/extensions/reference/sockets_tcp/

Even if we were willing to add the complexity of gRPC into the extension, is it even allowed, or must everything be HTTP? You made me curious with your point about native messaging.

I think it'd be better to eliminate that proxy and just let people connect directly to the Riva stack, as that alone is hard enough to setup for a novice, which I'll summarize below:

The link you provided goes to their table of contents which I basically followed in order to get the Riva stack running on my gaming computer which is running Windows 11, Docker Desktop, and WSL2 with Ubuntu, which is needed in order to run the bash scripts for controlling the stack which runs inside Docker with GPU support (a relatively new WSL2 feature).

ken107 commented 1 year ago

Well, at this time I think only a few users will be able to use this feature, so it's basically experimental, it doesn't have to be perfect, as long as it doesn't affect existing functionalities. So you don't need to implement that prosody feature, or leave it for later it's fine. And using a proxy is fine too, it'll be a work in progress.

Native messaging is a way for extension to talk to a native app that's installed separately in the OS. The native app still has to do the gRPC proxying, so I think using HTTP like you're doing is no difference. So never mind about native messaging.

As long as someone with the necessary hardware can set it up, this will be a fun experimental function they can try out.

kfatehi commented 1 year ago

@ken107 Ok great, it's ready for your review.

  1. I implemented prosody (pitch and rate interpreted within Riva) which meant passing a rate of 1 to the player, and letting the proxy interpret it into Riva's preferred scale. Likewise pitch is rescaled properly to that which Riva wants. Works great.
  2. Implemented streaming in the proxy (text -> wav -> ogg) so the latency is as low as possible (at the cost of lesser throughput, but this is worth it since we're talking personal use). This is working perfectly and it's super fast now regardless of sentence length.
  3. Implemented prefetching, which seems to work well and made the transition from paragraph to paragraph even faster.
  4. Fixed a bug where multi-sentence was breaking Riva, this was as simple as tokenizing the input paragraph into sentences.
  5. Added MIT license
  6. Published to Docker hub
  7. Updated the README to reflect the new API (JSON -> OGG stream) and how to use it straight from the Docker hub.
  8. Got the Docker image down to 141MB compressed (380MB according to docker images) by rewriting the Dockefile w/ alpine linux.

All in all, quite comprehensive! I'm more proud of the proxy now than before. A good Samaritan rich in GPU power could in theory deploy this, crank the WEB_CONCURRENCY environment variable way up, and serve to many people. In the meantime, people with modern consumer GPUs can also use it.

Great job on read-aloud -- it was surprisingly easy to pull this off and it's still not even Sunday yet :)