natlamir / PiperUI

A UI for the Piper TTS
42 stars 7 forks source link

Natlamir You're a Legend, but here's something that can make it even better #1

Open sarutobiumon opened 7 months ago

sarutobiumon commented 7 months ago

@natlamir Thanks for the awesome work! By the way, many of the languages do not work when I try to copy paste some stuff from Google Translate onto Arabic for example, or Chinese or Russian, nothing happens.

Wondering if it is in the roadmap to add a function to the UI to execute the steps below to clone a voice, add a new dataset for a certain person's cloned voice (Obama for example) or even add a new language...etc

https://www.youtube.com/watch?v=b_we_jma220

Of you can reuse this guy's webUI which he created specifically for that: https://ssamjh.nz/create-custom-piper-tts-voice/

Was also wondering if you can unify two of your WebUI functionalities, this PiperTTS one with a locally hosted Lip-Sync/Talking-Head, that would truly be fantastic!

natlamir commented 7 months ago

The other languages may have been due to the UTF encoding of the characters, I have a fix locally for this from another open issue which I will merge in about 1-2 days with some other changes. I was able to generate Russian audio using the voice dimitri in my local test.

It would be nice to be able to add voice cloning as described in the video and document, but it looks like it requires Linux to be able to do that according to that. I am in the process of using the Google Colab notebooks to clone the voice, and then drop the onnx of the cloned voice into a custom folder and be able to select that custom voice from a new dropdown in the UI.

I will plan on merging these changes also in the next 1-2 days. let me know if you find out anything about being able to clone the voice on Windows, without requiring Linux or WSL.

PiperTTS + wav2lip would be interesting. Add another input option for video and have the generated audio be the input for wav2lip audio.

sarutobiumon commented 7 months ago

@natlamir Thanks for your reply! I saw that you got your original voice back via the new google collab Piper training script, haha, that's awesome, no more having to pay ElevenLabs!

I think VideoRetalking might be a better choice than wav2lip considering the higher quality in the lip-sync'ing and clarity of the mouth movements compared to other projects. Speaking of which, I cannot find the project under your name for the VideoRetalking WebUI.

Would be awesome if VideoRetalking webUI and PiperTTS were combined into a single UI, you can pretty much make anyone say anything in a simulated semi real-time stream. What would truly be amazing is to have the output Video UI default to playing the input video with the person not saying anything as the default view in an endless-loop waiting for you to make it talk via a wav generated by PiperTTS and once the talking video finishes, the default input video takes over the output screen again, kind of like a live video interview.

natlamir commented 7 months ago

I wonder if it would be possible to run the Piper training code on Windows, I might try that out to see if I can get it to work, that would be a cool addition to the UI to be able to do that more conveniently if it can work on Windows.

For VideoRetalking, for my tests a while back, it took a while to generate the output. For a short 5-10 second clip, it would take around 5 minutes to generate if I remember correctly. I wonder if there is something like that for lip sync that is near real-time, even if it is on an image, it would be cool to have that text to speech to lip sync workflow automation for near real time talking lip sync of an image even.

For the VideoRetalking UI, I ended up adding the UI to the forked repository that I originally forked called video-retalking I think.

sarutobiumon commented 7 months ago

@natlamir Thanks for your reply! It looks like Microsoft, just today, released exactly what I had in mind, but it is a paid service.... https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-speech-announces-public-preview-of-text-to-speech/ba-p/3981448

@natlamir I took a closer look at this MS service, it looks like they are using an animated image in the center rather than a video since it is alot faster to process every time the avatar starts talking according to the tts lip-sync'ing.