More TTS!! - Githubissues

da3dsoul commented 1 year ago

Each TTS system has pros and cons. I'm going to build plugins for each of these to see how they perform, and since they'll be done, I may as well PR them.

https://github.com/coqui-ai/TTS - very good samples https://github.com/neonbjb/tortoise-tts - also very good samples https://github.com/CorentinJ/Real-Time-Voice-Cloning - custom voices? looks neat https://github.com/rhasspy/larynx - very low-spec compatible, acceptable quality https://github.com/TensorSpeech/TensorFlowTTS - very configurable from what I see

I care less about speed and more about quality, while some people might just want it to run with as little impact as possible.

Umm... we may need to rethink the UI layout for some things if all of these are actually accepted

St33lMouse commented 1 year ago

By the way, I'm running KDE ubuntu on a 3090 if you need that info.

Now trying to get mrq working. python server.py --chat --extensions tortoise_tts_mrq: ModuleNotFoundError: No module named 'tortoise.models.bigvgan'

Can't figure this one out. Need to know exactly what files it is looking for and where they should go.

And finally tortoise fast gets me: ImportError: cannot import name 'VocConf' from 'tortoise.models.vocoder' (/home/mouse/anaconda3/envs/soundgen/lib/python3.10/site-packages/TorToiSe-2.4.2-py3.10.egg/tortoise/models/vocoder.py)

da3dsoul commented 1 year ago

Dropping a new voice into the voices folder does not update the list of possible voices, I think.

That is WIP and almost done. I figured out the issue I had with UI updates, so I can do a bunch now.

Memory usage is heavy. I have pyg6B loaded, and tortoise pushes me close to the edge. Nothing you can do about that, I figure.

I'm going to add the code that Brawlence mentioned before that will allow swapping the models out. I've also added the low VRAM setting to MRQ.

These changes are all still local. The next update will be a fairly big one

St33lMouse commented 1 year ago

OK. remember, I can't get mrq or fast to work right now, only regular tortoise per my last report.

da3dsoul commented 1 year ago

MRQ and Fast have their own requirements.txt. did you pip install -r requirements.txt for each?

St33lMouse commented 1 year ago

Yes, same procedure for tortoise. Everything installed fine, no errors.

da3dsoul commented 1 year ago

/home/mouse/anaconda3/envs/soundgen/lib/python3.10/site-packages/TorToiSe-2.4.2-py3.10.egg/tortoise/models/vocoder.py

It appears to be looking in an egg, so I think the setup.py caused you issues like I said might happen. You can uninstall it as shown here https://stackoverflow.com/questions/1550226/python-setup-py-uninstall

St33lMouse commented 1 year ago

OK, nuked the conda environment and reinstalled, did not run setup.py. Installed Tortoise and tortoise mrq. Tortoise extension runs fine, but mrq complained about emma (the default tortoise voice), so I copied that voice folder to a couple of likely locations in the MRQ tree and the extension finally ran correctly.

The speed is still a couple of minutes at ultra fast (depending on the size of the sentence), and it shouldn't be--A quick sentence at ultra fast is about 15 seconds on my regular mrq installation. I note that the terminal shows it is computing conditional latents. I think mrq does that only once and saves the results in the voice folder so it doesn't have to compute every time.

Anyway, got it running, looking forward to your updates.

da3dsoul commented 1 year ago

Yep I've made decent progress. I'm testing at the moment. I forgot MRQ doesn't come with the voices. I can maybe make a script to check if they exist and pull them. That's a comparatively low priority, though.

This next update will have a lot of settings and UI for them, including custom directories for relevant things.

St33lMouse commented 1 year ago

No big deal on the pre-made tortoise voices for now, I agree.

The way MRQ works is you can just drop voices into the MRQ voice folder and it will do a very good imitation of them. it is a key feature--THE key feature and reason to use it. I doubt that MRQ is any faster that tortoise_fast, so this is the feature I'm hoping to see here.

Secondarily, MRQ allows voice training, and it produces a voice model. In MRQ you can point the gui at your trained model instead of autoregression.pth and use your voice clips to produce a high quality cloned voice. Training is beyond the scope of what you're doing, but being able to choose a model trained in the full MRQ setup would preserve what MRQ is all about.

da3dsoul commented 1 year ago

I see. I might change things up a bit to allow changing the directory from the UI. I made it obey the model dir argument, but I may want to choose a subdirectory structure to load into a dropdown

St33lMouse commented 1 year ago

Have you installed the full MRQ to see how it operates? He moved the folders around from the original tortoise setup, so watch out for that. But all you gotta do to make a new voice is take a few 10 second clips in wav format, put them in a folder in the voices folder, and start generating. Remember that part of the speed comes from an initial conditional latents calculation that he saves in the same folder as the wav files.

da3dsoul commented 1 year ago

I have not tried to use it past generating speech in the same way as the base tortoise. My goal was getting it running, letting people give feedback, and iterate. So far, according to plan lol. I noticed the code was reorganized a bit. I worked around that, too.

da3dsoul commented 1 year ago

Update is pushed. It's not perfect, and not all of the things mentioned are implemented yet. I'm trying to figure out the speed. It's very slow, and I'm not sure if it's my fault or with my MRQ setup. I haven't generated latents, for example

St33lMouse commented 1 year ago

It's probably calculating them every single generation instead of just once. So I just do a git pull, or something else?

da3dsoul commented 1 year ago

Yeah a git fetch --all and git pull should do it. If you get merge issues, you can use git stash

St33lMouse commented 1 year ago

OK, updated no problems. Speed seems pretty good at ultrafast. Probably should make sure ultrafast is your default choice, since what will happen is players will pick the example character which has a very long intro at standard generation, and then they'll be in for a ten minute sound generation treat :) They'll figure it's broken.

What's the purpose of the custom voice folder, and how might I use it?

By the way, I'm not a coder, I'm a game designer. My technical skills are no match for your own. I use, I don't make :)

da3dsoul commented 1 year ago

Custom directories are basically because I have a small boot SSD and a large bcached raid, so I point those to my mass storage. It also allows you to share a voices and models folder between the different builds

da3dsoul commented 1 year ago

When you say the speed is good...do you mean like 5 minutes or 20 seconds?

St33lMouse commented 1 year ago

45 seconds for a short generation at ultrafast. Double for standard.

The models are getting loaded and unloaded to save vram? I turned off the vram saving options, saw vram usage spike to an incredible 18gigs, and I only have the little opt-1.3 model loaded to prevent oom.

da3dsoul commented 1 year ago

Yeah MRQ seems to use a boatload of VRAM. I added all kinds of knobs to mess with, matching a lot of the options it has in the API. I needed to use the model unloading and low vram mode just to run the thing.

After building the latents, it dropped to about 1:45s for a test string of 11s. That actually matches more or less what you are getting with the difference of A4000 vs 3090. I can look at more stuff tomorrow. It is late

St33lMouse commented 1 year ago

Using the custom voice directory box works well. no need to refresh, it just looked in the new directory. Access to all my voices now.

OK, talk to you later.

da3dsoul commented 1 year ago

Ok, most of those options (the ones that are supported) from MRQ are now added to the other two. The fast branch is once again good, and it's fast enough to be interactive. With the model swapping, you can run it on a single reasonable GPU, as well! Tortoise does have its own issues with quality, but we may be able to work around some with preprocessing. In this example, there are some weird cadences and pauses:

Waffles and cakes are two different things, although they may share some similarities. Waffles are made with batter or dough that's poured into a grid-like iron and cooked until crispy on the outside and fluffy on the inside. Cakes, on the other hand, are baked in a rectangular pan and typically have a more moist and dense texture. While both waffles and cakes can be sweet, they are not the same thing.

https://user-images.githubusercontent.com/5205810/233737509-af7db745-7902-4363-bcde-c00c96a6af8b.mp4

Ph0rk0z commented 1 year ago

Did you get a look at bark? https://github.com/oobabooga/text-generation-webui/issues/1423

da3dsoul commented 1 year ago

I did not. I'll take a look. My next step is improving the oobx (out of box experience) for this, and then that seems like a decent thing to look at next.

wsippel commented 1 year ago

@Ph0rk0z There already is an extension for Bark - it's in the bug report you linked. I might try to upstream it once the dust settles, but given how new Bark is, it's probably better to maintain it as its own thing for the time being. Two of the three open pull requests change the API.

da3dsoul commented 1 year ago

@wsippel sweet! Feel free to use anything in here if it helps make it better

bubbabug commented 1 year ago

Hello de3dsoul, I am very interested in getting coqui_tts working, but I am having trouble. I have downloaded and copied over the extensions folder and pip installed the TTS requirement, but when I try to run the extension I get a failure to find tts_preprocessor file in modules. I attempted to copy over the preprocessor file from silero and I was able to startup, but it throws an error on generation. Where do I get the approrpiate tts_preprocessor file?

Edit: Nevermind, forgot to check your full repo.

da3dsoul commented 1 year ago

@bubbabug yeah, I had some basic instructions at https://github.com/oobabooga/text-generation-webui/issues/885#issuecomment-1509513725

wsippel commented 1 year ago

@da3dsoul Thank you, I'll take a look! Bark is quite unusual in that it does almost everything on it's own and exposes next to nothing. It's transformer-based, so it's fundamentally different from most other TTS systems. It generates speech the same way LLMs generate text, by constantly guessing the next phoneme (or sound, really) - it only uses the input text as guidance (which means it can go off the rails and start to ramble, ignoring most or all of the input text). It also generates ambient sounds. I asked it to tell my about Vicunas, and it did, but with jungle noises in the background. It also takes directions, like a movie script. Super interesting and powerful, but also very different from any other TTS engine I've ever worked with. Extremely fun to explore.

da3dsoul commented 1 year ago

Oh that's interesting. It's maybe not the best for a professional TTS system (where accuracy is king), but definitely neat for creative applications

wsippel commented 1 year ago

Yeah, it's very unpredictable, though it can be tamed to a degree. Maybe not really ideal for a digital assistant, let alone a screen reader, but the potential for stuff like an AI storyteller or roleplay is incredible. It 'acts'. Even at lower temperatures, it might do stuff like clearing its throat in the middle of a sentence, stutter, or inject dramatic pauses, 'um's, 'ah's or 'ya know's that aren't in the original text. You should give it a spin since you're interested in TTS - it's a trip.

St33lMouse commented 1 year ago

I'm not too sure that selecting a different voice is working in the MRQ version. I tried a male voice, but it still came out female. Might just be using emma for everything.

Opened my main MRQ install to have a look at default settings: cwp weight 1, diffusion temp: 1, length penalty 1, repetition penalty 2, conditoning free k: 2, temperature 0.8, diffusion model: diffusion_decoder.pth, vocoder:bigvgan_24khz_100 band

Using a custom finetune model (instead of autoregressive.pth) on the standard MRQ install with the voice files used to make the finetune produced noticeably recognizable results in 4 seconds at ultrafast, very acceptable. This was the text:

This is a test of the emergency broadcasting system. It is only another test.

vram started at 6 gigs, spiked only 1 more gig during that generation.

So the massive vram spikes you're experiencing isn't what I'm seeing in the regular install. At standard I watched it spike by 5 gigs, which isn't a good thing, but I get the same kind of spikes when doing text generations.

da3dsoul commented 1 year ago

@St33lMouse there's a bug that I addressed in https://github.com/oobabooga/text-generation-webui/pull/1484. Basically, settings aren't loaded before init, so the voices aren't properly loaded at init. You can refresh it by just adding a / to the voice dir and deleting it

St33lMouse commented 1 year ago

I don't understand. what voice dir? Put a slash where?

Right now I can load plain tortoise_tts with python server.py --chat --extensions tortoise_tts And I get a working tortoise interface with a list of all the tortoise voices. Selecting a new voice changes to the new voice correctly. I test first with a woman's voice, then a man's to be sure it has changed to a new voice. The custom voices box in the interface is empty.

Adding a folder path to the custom voices box will immediately add those voices to the list, and selecting them works. I don't think it is updating quality settings when I move from ultra fast to standard, because it still generates very quickly.

So basically, the plain tortoise_tts extension is working pretty good.

Today, I'm not having much success with the MRQ version. I'm not able to get it to work well enough to test switching voices. I am ooming with just a tiny opt 1.3B model loaded and 24 gigs of vram. When this happens, it passes the message back to my character, as though the character said it:

CUDA out of memory. Tried to allocate 16.02 GiB (GPU 0; 23.70 GiB total capacity; 18.06 GiB already allocated; 4.04 GiB free; 18.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

da3dsoul commented 1 year ago

Ok, sounds like you weren't affected by the bug I mentioned. I can test further with the preset and tuning settings, but they were working. How do you run it in a normal install? I pulled my method from read.py

St33lMouse commented 1 year ago

How do you run what?

Hang on, I just realized I didn't do a git pull, so I just did a git pull for your repo, so gotta test again.

da3dsoul commented 1 year ago

I mean when you run it without text-generation-webui, what command do you use?

St33lMouse commented 1 year ago

python server.py --chat --extensions tortoise_tts_mrq

After the git pull, tested again. MRQ is still ooming for me, so can't check switching voices. regular tortoise is fine, but I've seen it spike by 14 gigs vram. Still don't think quality settings in regular tortoise are doing anything.

tanfarou commented 1 year ago

@Ph0rk0z @St33lMouse @Brawlence I said it as an aside, but I'm going to tag you. I need testing/assistance on Tortoise. I either need help getting it to run alongside vicuna, which leaves 8GB of VRAM left to work with, and that seems reasonable, or just people to run it and gather data for it. It is here. If you just want to take my changes, you can grab modules/tts_preprocessor.py and the extensions/tortoise_tts* folders. Each one has an install_tortoise.sh script. cd into the extensions/tortoise_tts* folder and run sh install_tortoise.sh. The scripts will override the version bindings to not fuck your conda environment. That much is tested.

I am testing on a Windows PC running 3090. Would the install_tortoise.sh create a tortoise folder and then clone everything into directly, without adding "tortoise_tts_fast" again? (sorry, not familiar with shell). If I can get it running, can do some tests with vicuna running.

da3dsoul commented 1 year ago

It should run with git bash, but WSL is recommended for text-generation-webui on windows, as far as I've seen

da3dsoul commented 1 year ago

So the massive vram spikes you're experiencing isn't what I'm seeing in the regular install. At standard I watched it spike by 5 gigs, which isn't a good thing, but I get the same kind of spikes when doing text generations.

I thought you meant that running MRQ standalone didn't have the VRAM issues, but after rereading, you could have meant the non-mrq version

St33lMouse commented 1 year ago

I have never seen stand alone tortoise or mrq spike through the roof. They both seem to use about 6 gigs for inference. That doesn't mean they never to, but I haven't seen it. Yours versions--both tortoise and mrq-- I've seen spike by more than 14 gigs.

Ph0rk0z commented 1 year ago

I think they both have parameters that need to be passed for that. Otherwise they use all the vram they can.

da3dsoul commented 1 year ago

There is a "Low VRAM" mode for both, but that doesn't do that. It just doesn't precache. It only uses as much VRAM as it "needs", but that's still a lot. I've not tested standalone vs this. The preset definitely makes a difference in VRAM, and MRQ seems to use the most VRAM of all of them. I fixed the issue with the parameters in the UI throwing an error, too

da3dsoul commented 1 year ago

@tanfarou I'm not sure I understood what you meant. The install.sh just runs a few git commands, edits requirements.txt to not cause version conflicts, then runs pip install -r requirements.txt. The fancy git commands are for a sparse checkout, which only pulls what we need, rather than the whole repository. The goal was to avoid wasting space on things like the example output audio files

oobabooga commented 1 year ago

I'll write some comments for #1106 here:

coqui-tts sounds better than Silero, which is too unemotional. I liked it, but a con was that it installed a huge number of dependencies. It would also be nice to have more default voices to select from a dropdown menu, is that possible?
I couldn't test any of the tortoise implementations because all the install script did was create an empty folder with a .git in it as someone reported above. From past experience, it uses a huge amount of VRAM and is insanely slow. Given this and given the complicated installation procedure, I vote for not including it as a built-in extension into the repository.
For bark, there are these two extensions and I haven't tried any of them:

https://github.com/minemo/text-generation-webui-barktts https://github.com/wsippel/bark_tts

EDIT:

I have just tried the second extension above. Bark is quite impressive. It's interesting that it hallucinates sounds sometimes just like LLMs hallucinate text. Also, the audio quality seems to not match the one in the official examples.

Here is a sample:

https://vocaroo.com/1aPTlCYDakoH

OK, here we go. In 1943, Konrad Zuse built the first programmable computer called “Z-1”. It was used for military purposes during World War II. Then in 1950s, John Atanasoff invented an electronic digital computer named “Atanasoff–Berry Computer” (ABC). Later, in 1962, IBM introduced its first commercial computer system known as System/360. Nowadays, there are many different types of computers such as desktop PCs, laptops, tablets, smartphones etc.

da3dsoul commented 1 year ago

I couldn't test any of the tortoise implementations because all the install script did was create an empty folder with a .git in it as someone reported above. From past experience, it uses a huge amount of VRAM and is insanely slow. Given this and given the complicated installation procedure, I vote for not including it as a built-in extension into the repository.

Yeah it has high minimum requirements, for sure. The fast repo is actually fast, especially if you pre-compute the conditioning latents. I got it to unload the LLM, load tortoise, run it, and reload the LLM in under 30s for some responses. I didn't want to just clone the whole repo, but it might be necessary considering how inconsistent the sparse checkout is.

If it doesn't make it in as an included extension, we should probably have a wiki list with other extensions people have made. I didn't know about a lot of them.

Ph0rk0z commented 1 year ago

Yes, more extensions are better. Also want to try using a separate GPU for tortoise when I get my stuff in.

da3dsoul commented 1 year ago

It would also be nice to have more default voices to select from a dropdown menu, is that possible?

I'll look into that today if I find time. Tortoise got most of the UI love, as it was harder to work with and got more feedback.

da3dsoul commented 1 year ago

Ok, I added some descriptions to the model dropdown and made note of which ones need espeak-ng for Coqui. All of the English models are added. Due to how many models there are, a different kind of UI is necessary for all of the different languages.

oobabooga / text-generation-webui

More TTS!! #885