Closed da3dsoul closed 9 months ago
By the way, I'm running KDE ubuntu on a 3090 if you need that info.
Now trying to get mrq working. python server.py --chat --extensions tortoise_tts_mrq: ModuleNotFoundError: No module named 'tortoise.models.bigvgan'
Can't figure this one out. Need to know exactly what files it is looking for and where they should go.
And finally tortoise fast gets me: ImportError: cannot import name 'VocConf' from 'tortoise.models.vocoder' (/home/mouse/anaconda3/envs/soundgen/lib/python3.10/site-packages/TorToiSe-2.4.2-py3.10.egg/tortoise/models/vocoder.py)
Dropping a new voice into the voices folder does not update the list of possible voices, I think.
That is WIP and almost done. I figured out the issue I had with UI updates, so I can do a bunch now.
Memory usage is heavy. I have pyg6B loaded, and tortoise pushes me close to the edge. Nothing you can do about that, I figure.
I'm going to add the code that Brawlence mentioned before that will allow swapping the models out. I've also added the low VRAM setting to MRQ.
These changes are all still local. The next update will be a fairly big one
OK. remember, I can't get mrq or fast to work right now, only regular tortoise per my last report.
MRQ and Fast have their own requirements.txt. did you pip install -r requirements.txt
for each?
Yes, same procedure for tortoise. Everything installed fine, no errors.
/home/mouse/anaconda3/envs/soundgen/lib/python3.10/site-packages/TorToiSe-2.4.2-py3.10.egg/tortoise/models/vocoder.py
It appears to be looking in an egg, so I think the setup.py caused you issues like I said might happen. You can uninstall it as shown here https://stackoverflow.com/questions/1550226/python-setup-py-uninstall
OK, nuked the conda environment and reinstalled, did not run setup.py. Installed Tortoise and tortoise mrq. Tortoise extension runs fine, but mrq complained about emma (the default tortoise voice), so I copied that voice folder to a couple of likely locations in the MRQ tree and the extension finally ran correctly.
The speed is still a couple of minutes at ultra fast (depending on the size of the sentence), and it shouldn't be--A quick sentence at ultra fast is about 15 seconds on my regular mrq installation. I note that the terminal shows it is computing conditional latents. I think mrq does that only once and saves the results in the voice folder so it doesn't have to compute every time.
Anyway, got it running, looking forward to your updates.
Yep I've made decent progress. I'm testing at the moment. I forgot MRQ doesn't come with the voices. I can maybe make a script to check if they exist and pull them. That's a comparatively low priority, though.
This next update will have a lot of settings and UI for them, including custom directories for relevant things.
No big deal on the pre-made tortoise voices for now, I agree.
The way MRQ works is you can just drop voices into the MRQ voice folder and it will do a very good imitation of them. it is a key feature--THE key feature and reason to use it. I doubt that MRQ is any faster that tortoise_fast, so this is the feature I'm hoping to see here.
Secondarily, MRQ allows voice training, and it produces a voice model. In MRQ you can point the gui at your trained model instead of autoregression.pth and use your voice clips to produce a high quality cloned voice. Training is beyond the scope of what you're doing, but being able to choose a model trained in the full MRQ setup would preserve what MRQ is all about.
I see. I might change things up a bit to allow changing the directory from the UI. I made it obey the model dir argument, but I may want to choose a subdirectory structure to load into a dropdown
Have you installed the full MRQ to see how it operates? He moved the folders around from the original tortoise setup, so watch out for that. But all you gotta do to make a new voice is take a few 10 second clips in wav format, put them in a folder in the voices folder, and start generating. Remember that part of the speed comes from an initial conditional latents calculation that he saves in the same folder as the wav files.
I have not tried to use it past generating speech in the same way as the base tortoise. My goal was getting it running, letting people give feedback, and iterate. So far, according to plan lol. I noticed the code was reorganized a bit. I worked around that, too.
Update is pushed. It's not perfect, and not all of the things mentioned are implemented yet. I'm trying to figure out the speed. It's very slow, and I'm not sure if it's my fault or with my MRQ setup. I haven't generated latents, for example
It's probably calculating them every single generation instead of just once. So I just do a git pull, or something else?
Yeah a git fetch --all
and git pull
should do it. If you get merge issues, you can use git stash
OK, updated no problems. Speed seems pretty good at ultrafast. Probably should make sure ultrafast is your default choice, since what will happen is players will pick the example character which has a very long intro at standard generation, and then they'll be in for a ten minute sound generation treat :) They'll figure it's broken.
What's the purpose of the custom voice folder, and how might I use it?
By the way, I'm not a coder, I'm a game designer. My technical skills are no match for your own. I use, I don't make :)
Custom directories are basically because I have a small boot SSD and a large bcached raid, so I point those to my mass storage. It also allows you to share a voices and models folder between the different builds
When you say the speed is good...do you mean like 5 minutes or 20 seconds?
45 seconds for a short generation at ultrafast. Double for standard.
The models are getting loaded and unloaded to save vram? I turned off the vram saving options, saw vram usage spike to an incredible 18gigs, and I only have the little opt-1.3 model loaded to prevent oom.
Yeah MRQ seems to use a boatload of VRAM. I added all kinds of knobs to mess with, matching a lot of the options it has in the API. I needed to use the model unloading and low vram mode just to run the thing.
After building the latents, it dropped to about 1:45s for a test string of 11s. That actually matches more or less what you are getting with the difference of A4000 vs 3090. I can look at more stuff tomorrow. It is late
Using the custom voice directory box works well. no need to refresh, it just looked in the new directory. Access to all my voices now.
OK, talk to you later.
Ok, most of those options (the ones that are supported) from MRQ are now added to the other two. The fast branch is once again good, and it's fast enough to be interactive. With the model swapping, you can run it on a single reasonable GPU, as well! Tortoise does have its own issues with quality, but we may be able to work around some with preprocessing. In this example, there are some weird cadences and pauses:
Waffles and cakes are two different things, although they may share some similarities. Waffles are made with batter or dough that's poured into a grid-like iron and cooked until crispy on the outside and fluffy on the inside. Cakes, on the other hand, are baked in a rectangular pan and typically have a more moist and dense texture. While both waffles and cakes can be sweet, they are not the same thing.
https://user-images.githubusercontent.com/5205810/233737509-af7db745-7902-4363-bcde-c00c96a6af8b.mp4
Did you get a look at bark? https://github.com/oobabooga/text-generation-webui/issues/1423
I did not. I'll take a look. My next step is improving the oobx (out of box experience) for this, and then that seems like a decent thing to look at next.
@Ph0rk0z There already is an extension for Bark - it's in the bug report you linked. I might try to upstream it once the dust settles, but given how new Bark is, it's probably better to maintain it as its own thing for the time being. Two of the three open pull requests change the API.
@wsippel sweet! Feel free to use anything in here if it helps make it better
Hello de3dsoul, I am very interested in getting coqui_tts working, but I am having trouble. I have downloaded and copied over the extensions folder and pip installed the TTS requirement, but when I try to run the extension I get a failure to find tts_preprocessor file in modules. I attempted to copy over the preprocessor file from silero and I was able to startup, but it throws an error on generation. Where do I get the approrpiate tts_preprocessor file?
Edit: Nevermind, forgot to check your full repo.
@bubbabug yeah, I had some basic instructions at https://github.com/oobabooga/text-generation-webui/issues/885#issuecomment-1509513725
@da3dsoul Thank you, I'll take a look! Bark is quite unusual in that it does almost everything on it's own and exposes next to nothing. It's transformer-based, so it's fundamentally different from most other TTS systems. It generates speech the same way LLMs generate text, by constantly guessing the next phoneme (or sound, really) - it only uses the input text as guidance (which means it can go off the rails and start to ramble, ignoring most or all of the input text). It also generates ambient sounds. I asked it to tell my about Vicunas, and it did, but with jungle noises in the background. It also takes directions, like a movie script. Super interesting and powerful, but also very different from any other TTS engine I've ever worked with. Extremely fun to explore.
Oh that's interesting. It's maybe not the best for a professional TTS system (where accuracy is king), but definitely neat for creative applications
Yeah, it's very unpredictable, though it can be tamed to a degree. Maybe not really ideal for a digital assistant, let alone a screen reader, but the potential for stuff like an AI storyteller or roleplay is incredible. It 'acts'. Even at lower temperatures, it might do stuff like clearing its throat in the middle of a sentence, stutter, or inject dramatic pauses, 'um's, 'ah's or 'ya know's that aren't in the original text. You should give it a spin since you're interested in TTS - it's a trip.
I'm not too sure that selecting a different voice is working in the MRQ version. I tried a male voice, but it still came out female. Might just be using emma for everything.
Opened my main MRQ install to have a look at default settings: cwp weight 1, diffusion temp: 1, length penalty 1, repetition penalty 2, conditoning free k: 2, temperature 0.8, diffusion model: diffusion_decoder.pth, vocoder:bigvgan_24khz_100 band
Using a custom finetune model (instead of autoregressive.pth) on the standard MRQ install with the voice files used to make the finetune produced noticeably recognizable results in 4 seconds at ultrafast, very acceptable. This was the text:
This is a test of the emergency broadcasting system. It is only another test.
vram started at 6 gigs, spiked only 1 more gig during that generation.
So the massive vram spikes you're experiencing isn't what I'm seeing in the regular install. At standard I watched it spike by 5 gigs, which isn't a good thing, but I get the same kind of spikes when doing text generations.
@St33lMouse there's a bug that I addressed in https://github.com/oobabooga/text-generation-webui/pull/1484. Basically, settings aren't loaded before init, so the voices aren't properly loaded at init. You can refresh it by just adding a / to the voice dir and deleting it
I don't understand. what voice dir? Put a slash where?
Right now I can load plain tortoise_tts with python server.py --chat --extensions tortoise_tts And I get a working tortoise interface with a list of all the tortoise voices. Selecting a new voice changes to the new voice correctly. I test first with a woman's voice, then a man's to be sure it has changed to a new voice. The custom voices box in the interface is empty.
Adding a folder path to the custom voices box will immediately add those voices to the list, and selecting them works. I don't think it is updating quality settings when I move from ultra fast to standard, because it still generates very quickly.
So basically, the plain tortoise_tts extension is working pretty good.
Today, I'm not having much success with the MRQ version. I'm not able to get it to work well enough to test switching voices. I am ooming with just a tiny opt 1.3B model loaded and 24 gigs of vram. When this happens, it passes the message back to my character, as though the character said it:
CUDA out of memory. Tried to allocate 16.02 GiB (GPU 0; 23.70 GiB total capacity; 18.06 GiB already allocated; 4.04 GiB free; 18.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Ok, sounds like you weren't affected by the bug I mentioned. I can test further with the preset and tuning settings, but they were working. How do you run it in a normal install? I pulled my method from read.py
How do you run what?
Hang on, I just realized I didn't do a git pull, so I just did a git pull for your repo, so gotta test again.
I mean when you run it without text-generation-webui, what command do you use?
python server.py --chat --extensions tortoise_tts_mrq
After the git pull, tested again. MRQ is still ooming for me, so can't check switching voices. regular tortoise is fine, but I've seen it spike by 14 gigs vram. Still don't think quality settings in regular tortoise are doing anything.
@Ph0rk0z @St33lMouse @Brawlence I said it as an aside, but I'm going to tag you. I need testing/assistance on Tortoise. I either need help getting it to run alongside vicuna, which leaves 8GB of VRAM left to work with, and that seems reasonable, or just people to run it and gather data for it. It is here. If you just want to take my changes, you can grab
modules/tts_preprocessor.py
and theextensions/tortoise_tts*
folders. Each one has aninstall_tortoise.sh
script. cd into theextensions/tortoise_tts*
folder and runsh install_tortoise.sh
. The scripts will override the version bindings to not fuck your conda environment. That much is tested.
I am testing on a Windows PC running 3090. Would the install_tortoise.sh create a tortoise folder and then clone everything into directly, without adding "tortoise_tts_fast" again? (sorry, not familiar with shell). If I can get it running, can do some tests with vicuna running.
It should run with git bash, but WSL is recommended for text-generation-webui on windows, as far as I've seen
So the massive vram spikes you're experiencing isn't what I'm seeing in the regular install. At standard I watched it spike by 5 gigs, which isn't a good thing, but I get the same kind of spikes when doing text generations.
I thought you meant that running MRQ standalone didn't have the VRAM issues, but after rereading, you could have meant the non-mrq version
I have never seen stand alone tortoise or mrq spike through the roof. They both seem to use about 6 gigs for inference. That doesn't mean they never to, but I haven't seen it. Yours versions--both tortoise and mrq-- I've seen spike by more than 14 gigs.
I think they both have parameters that need to be passed for that. Otherwise they use all the vram they can.
There is a "Low VRAM" mode for both, but that doesn't do that. It just doesn't precache. It only uses as much VRAM as it "needs", but that's still a lot. I've not tested standalone vs this. The preset definitely makes a difference in VRAM, and MRQ seems to use the most VRAM of all of them. I fixed the issue with the parameters in the UI throwing an error, too
@tanfarou I'm not sure I understood what you meant. The install.sh just runs a few git commands, edits requirements.txt to not cause version conflicts, then runs pip install -r requirements.txt. The fancy git commands are for a sparse checkout, which only pulls what we need, rather than the whole repository. The goal was to avoid wasting space on things like the example output audio files
I'll write some comments for #1106 here:
.git
in it as someone reported above. From past experience, it uses a huge amount of VRAM and is insanely slow. Given this and given the complicated installation procedure, I vote for not including it as a built-in extension into the repository.https://github.com/minemo/text-generation-webui-barktts https://github.com/wsippel/bark_tts
EDIT:
I have just tried the second extension above. Bark is quite impressive. It's interesting that it hallucinates sounds sometimes just like LLMs hallucinate text. Also, the audio quality seems to not match the one in the official examples.
Here is a sample:
https://vocaroo.com/1aPTlCYDakoH
OK, here we go. In 1943, Konrad Zuse built the first programmable computer called “Z-1”. It was used for military purposes during World War II. Then in 1950s, John Atanasoff invented an electronic digital computer named “Atanasoff–Berry Computer” (ABC). Later, in 1962, IBM introduced its first commercial computer system known as System/360. Nowadays, there are many different types of computers such as desktop PCs, laptops, tablets, smartphones etc.
I couldn't test any of the tortoise implementations because all the install script did was create an empty folder with a .git in it as someone reported above. From past experience, it uses a huge amount of VRAM and is insanely slow. Given this and given the complicated installation procedure, I vote for not including it as a built-in extension into the repository.
Yeah it has high minimum requirements, for sure. The fast repo is actually fast, especially if you pre-compute the conditioning latents. I got it to unload the LLM, load tortoise, run it, and reload the LLM in under 30s for some responses. I didn't want to just clone the whole repo, but it might be necessary considering how inconsistent the sparse checkout is.
If it doesn't make it in as an included extension, we should probably have a wiki list with other extensions people have made. I didn't know about a lot of them.
Yes, more extensions are better. Also want to try using a separate GPU for tortoise when I get my stuff in.
It would also be nice to have more default voices to select from a dropdown menu, is that possible?
I'll look into that today if I find time. Tortoise got most of the UI love, as it was harder to work with and got more feedback.
Ok, I added some descriptions to the model dropdown and made note of which ones need espeak-ng for Coqui. All of the English models are added. Due to how many models there are, a different kind of UI is necessary for all of the different languages.
Each TTS system has pros and cons. I'm going to build plugins for each of these to see how they perform, and since they'll be done, I may as well PR them.
https://github.com/coqui-ai/TTS - very good samples https://github.com/neonbjb/tortoise-tts - also very good samples https://github.com/CorentinJ/Real-Time-Voice-Cloning - custom voices? looks neat https://github.com/rhasspy/larynx - very low-spec compatible, acceptable quality https://github.com/TensorSpeech/TensorFlowTTS - very configurable from what I see
I care less about speed and more about quality, while some people might just want it to run with as little impact as possible.
Umm... we may need to rethink the UI layout for some things if all of these are actually accepted