oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.88k stars 5.23k forks source link

More TTS!! #885

Closed da3dsoul closed 9 months ago

da3dsoul commented 1 year ago

Each TTS system has pros and cons. I'm going to build plugins for each of these to see how they perform, and since they'll be done, I may as well PR them.

https://github.com/coqui-ai/TTS - very good samples https://github.com/neonbjb/tortoise-tts - also very good samples https://github.com/CorentinJ/Real-Time-Voice-Cloning - custom voices? looks neat https://github.com/rhasspy/larynx - very low-spec compatible, acceptable quality https://github.com/TensorSpeech/TensorFlowTTS - very configurable from what I see

I care less about speed and more about quality, while some people might just want it to run with as little impact as possible.

Umm... we may need to rethink the UI layout for some things if all of these are actually accepted

IllogicalDesigns commented 1 year ago

I really like tortise for the quality it produces, but it is slow. Have you tried this Tortoise-tts-fast repo. It is no longer in development and I had a hell of a time setting it up, but it is much faster. https://github.com/152334H/tortoise-tts-fast

da3dsoul commented 1 year ago

I'll check it out, thanks

Ph0rk0z commented 1 year ago

https://git.ecker.tech/mrq/ai-voice-cloning

da3dsoul commented 1 year ago

Ok, Coqui AI TTS is working, at least. More configurability is WIP. It is about as fast as silero using cuda on my A4000 (comparable to an RTX 3070, but with 16GB of VRAM). It is slow as balls on cpu (3600X, but I doubt it matters much when it takes 5+ minutes on a recent enough mid-range CPU).

Compared against Silero

Pros:

Cons:

da3dsoul commented 1 year ago

@IllogicalDesigns I've yet to test, but from reading around, it appears that the difference between the main branch and the fast fork is narrowing. Fast is still faster, but only by about 10%, rather than 5-10x like it was. We may need to make a decision between it being fast now and it being maintained. I'll try both and report back, of course

St33lMouse commented 1 year ago

So this all runs locally on your machine? Very interested and will be following keenly. I was thinking of making an extension to use tortoise for voices.

The problem with Silero is quality and you're stuck with the voices they have. The problem with elevenlabs is that everything resides on their machines and they can see what you've been chatting about. Don't like that :) And you have to pay.

da3dsoul commented 1 year ago

Yes! That was my intention. I prefer local systems and customization. Coqui is in a PR. Tortoise is next

St33lMouse commented 1 year ago

I haven't tried Coqui yet, but I have worked with the MRQ repository. I'm not sure how that differs from the tortoise fast version, but you can certainly do voice training with MRQ. The problem is the speed (of course), and vram usage which will require downgrading your language model if you want everything to fit.

Tortoise or the MRQ version (not sure) has issues with long sentences. MRQ attacked that problem with a new line character to signal breaking off and starting a new generation. That worked, but you tended to get distortion near the ends of lines. Tortoise may be slow and buggy, but it IS very controllable.

Anyway, good luck on this. I'm dying to try it!

St33lMouse commented 1 year ago

Oh, a small but useful feature would be assigning a specific voice to a character automatically.

Brawlence commented 1 year ago

You'd probably want to make this into a comparison table, like this

Header1 Model 1 Model 2
Type blabla haha
Pro Much Also many
Con not many not much

It's made like this:

Header1 | Model 1 | Model 2
-|-|-
Type| blabla | haha
Pro| Much | Also many
Con| not many | not much

Also for pros, don't forget to include zero-shot voice cloning. It's very important, for with Tortoise-TTS I can simply put 4 voice files with 40 seconds each to reliably compute voice features and generate (never-before-seen in training) voice.

da3dsoul commented 1 year ago

Can confirm.... Tortoise is slow. Got it running in a test. I'll probably try the "fast" version next. It's possible that the first voice I tried happened to be slower than normal. This is the first test, after all.

EDIT: That finally finished. Only took like 10 minutes.... Soooo, "train_dreams" at the least is very slow and not even as natural sounding as Coqui's example voices, and Coqui is fast enough to use in a chat. This will take more experimenting.

EDIT2: emma (default voice) isn't any faster, maybe even slower tbh. Even if this was 10x faster, I still wouldn't use it. As I said, I'll still PR it in case someone else really wants to use it. I'll also still make an install script and extension for the Fast fork.

St33lMouse commented 1 year ago

are you using the MRQ repo? It IS slow. But if you put maybe four 15 second clips in the voice folder and use ultra fast to generate, it should take about 15 seconds. At least it does for me on a 3090. The first generation will be extra slow since it has to do some preparatory work on your voice clips. The advantage of the MRQ repo is the voice fine tuning.

I haven't tried the tortoise-tts fast repo, may get better results there.

While we wait for the pull request, is it possible to install your extension and try it out? Drop the files in extensions, maybe?

da3dsoul commented 1 year ago

I don't have a 3090. I have an A4000. I mentioned in a previous comment how that compares. I am not using custom voice clips, just the built-in voices. Is ultrafast the preset? I am using standard for preset. I am not using MRQ's fork, though I could potentially. I suppose I can make a branch in my fork and upload what I have. It's still in the testing phase, but so is all of AI, I guess.

St33lMouse commented 1 year ago

I've only ever tried the mrq repo. It has a gui with the ultrafast setting. I could look at the preset numbers for you tomorrow--not at my strong computer today.

Silero for me is very fast. I imagine Coqui will be as well. I'm eager to check out Coqui and see if I can clone voices with it easily enough. If I can, it's golden for me.

da3dsoul commented 1 year ago

Based on what I saw, you can. I don't know how good the quality is, but the process seemed simple enough. Here's the branch with the tortoise extension. Run install_tortoise.sh with the current directory set to ./extensions/tortoise_tts, then it should just work. https://github.com/da3dsoul/text-generation-webui

Ph0rk0z commented 1 year ago

Coqui is not good at copying voices like mrq but it is a good sounding TTS. MRQ now uses stock tortoise AFAIK.

Ideally I would like to run TTS/SD on one GPU and the LLM on another.

da3dsoul commented 1 year ago

Ideally I would like to run TTS/SD on one GPU and the LLM on another.

That can be done. Even with the current setup, you can choose which GPU both run on.

@oobabooga can you remove the Spanish nonsense (not because it's Spanish, but because it's nonsense)

Ph0rk0z commented 1 year ago

Some AI wanted to make a bug report too xD

da3dsoul commented 1 year ago

The fork has a tortoise_tts_fast extension now. It includes a bash script to successfully install it. I don't have the VRAM to test in UI mode (not using my test scripts), so if one of you would like to, by all means. With the ultra_fast preset, it's fast enough to be considered interactive again. It just takes a lot of VRAM, or I have it setup wrong....both are reasonable.

EDIT: MRQ is there now, too

P.S. this issue got buried fast....

da3dsoul commented 1 year ago

I'm dumping some links here, mostly for my own usage. Others may find them useful. They are all related to text normalization, which is the process of expanding text into more proper language or phonemes for TTS. The one from XuanyiZ looks especially neat for something like Neuro-sama reading Twitch chat. https://github.com/google-research-datasets/TextNormalizationCoveringGrammars https://github.com/shauryr/google_text_normalization https://github.com/XuanyiZ/Text-Normalization https://github.com/csebuetnlp/normalizer https://github.com/tomaarsen/TTSTextNormalization

This might be useful, but not by itself, most likely: https://github.com/cbaziotis/ekphrasis

da3dsoul commented 1 year ago

I merged this into the Coqui PR, because...it's kind of dependent on some core changes (the tts_preprocessor). Now it has Coqui and the Tortoise implementations that have been discussed.

EDIT: From what I can tell, Larynx and TensorFlowTTS only really have a benefit in their UI and maybe preprocessing, as the voice models are all overlapping with Coqui AI. I am going to skip those unless someone has a reason for me to pursue them.

da3dsoul commented 1 year ago

@Ph0rk0z @St33lMouse @Brawlence I said it as an aside, but I'm going to tag you. I need testing/assistance on Tortoise. I either need help getting it to run alongside vicuna, which leaves 8GB of VRAM left to work with, and that seems reasonable, or just people to run it and gather data for it. It is here. If you just want to take my changes, you can grab modules/tts_preprocessor.py and the extensions/tortoise_tts* folders. Each one has an install_tortoise.sh script. cd into the extensions/tortoise_tts* folder and run sh install_tortoise.sh. The scripts will override the version bindings to not fuck your conda environment. That much is tested.

BarfingLemurs commented 1 year ago

@da3dsoul hi, request for recommendations/hints, info on the best backend for a user's setup within the interface

da3dsoul commented 1 year ago

@da3dsoul hi, request for recommendations/hints, info on the best backend for a user's setup within the interface

I'm not sure what you mean. If you are asking which TTS is the best, then right now, I'd say Coqui, but they each have their own pros and cons. Silero is the easiest to run, as it's equally as fast on CPU, leaving the GPU to run the LLM inferencing. Coqui is higher quality and has a lot of options, but requires some GPU to run. The minimum specs for it depend on settings. The three versions of Tortoise are included because they each have advantages. The original repo presumably will have the most support. The MRQ repo has a lot of extra features. The fast repo is geared towards inferencing as fast as possible. It's no longer maintained, as of a few weeks ago. We'll see if someone picks up the mantle

Ph0rk0z commented 1 year ago

All my tortoise and MRQ have different venvs. This forces you to use the same VENV.

If I want to use an AMD or some other GPU for TTS it will not work because the env will be the one from nvidia. Same if mrq or tortoise ever have conflicting requirements for packages.

Also would be nice to have voices automatically read from the voices location instead of being hard coded. If you made other voices in tortoise they won't show up.

da3dsoul commented 1 year ago

All my tortoise and MRQ have different venvs. This forces you to use the same VENV.

If I want to use an AMD or some other GPU for TTS it will not work because the env will be the one from nvidia. Same if mrq or tortoise ever have conflicting requirements for packages.

Well, the only alternative is calling it via a WebAPI. I don't know if one is built-in. All I saw was a webui. My goal here was to make the usage as simple as possible, like Silero is.

Also would be nice to have voices automatically read from the voices location instead of being hard coded. If you made other voices in tortoise they won't show up.

A configurable voices path is easy enough. I can add that

Ph0rk0z commented 1 year ago

It is indeed working.. but I think tortoise is way heavy for this type of thing on one gpu. Still it's fucking funny to have joe biden responding, even if it does take 60 sec to make it sound good.

I'll have to see how it juggles loading and unloading a bigger model. I've been doing the 7b so far.

So for performance here is some info:

Using ultra_fast preset on normal tortoise.. (perhaps need some other TTS params too besides preset, it uses 18gb Vram)

alpaca-native-7b
Output generated in 67.54 seconds (0.37 tokens/s, 25 tokens, context 638, seed 337508091)
Output generated in 46.70 seconds (0.60 tokens/s, 28 tokens, context 581, seed 50929688)

gptx-alpaca-13b
Output generated in 26.62 seconds (0.83 tokens/s, 22 tokens, context 695, seed 1928108461)
But sometimes it generates more than one audio for some reason:

Generating autoregressive samples..
100%|████████████████████████████████████████████| 1/1 [00:02<00:00,  2.95s/it]
Computing best candidates using CLVP
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 10.01it/s]
Transforming autoregressive outputs into audio..
100%|██████████████████████████████████████████| 30/30 [00:00<00:00, 52.05it/s]
Generating autoregressive samples..
100%|████████████████████████████████████████████| 1/1 [00:43<00:00, 43.04s/it]
Computing best candidates using CLVP
100%|████████████████████████████████████████████| 1/1 [00:00<00:00,  7.75it/s]
Transforming autoregressive outputs into audio..
100%|██████████████████████████████████████████| 30/30 [00:01<00:00, 20.85it/s]
Generating autoregressive samples..
100%|████████████████████████████████████████████| 1/1 [00:52<00:00, 52.33s/it]
Computing best candidates using CLVP
100%|████████████████████████████████████████████| 1/1 [00:00<00:00,  7.18it/s]
Transforming autoregressive outputs into audio..
100%|██████████████████████████████████████████| 30/30 [00:01<00:00, 16.83it/s]
Generating autoregressive samples..
100%|████████████████████████████████████████████| 1/1 [00:38<00:00, 38.42s/it]
Computing best candidates using CLVP
100%|████████████████████████████████████████████| 1/1 [00:00<00:00,  8.05it/s]
Transforming autoregressive outputs into audio..
100%|██████████████████████████████████████████| 30/30 [00:01<00:00, 17.93it/s]
Generating autoregressive samples..
100%|████████████████████████████████████████████| 1/1 [00:01<00:00,  1.15s/it]
Computing best candidates using CLVP
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 10.13it/s]
Transforming autoregressive outputs into audio..
100%|██████████████████████████████████████████| 30/30 [00:00<00:00, 52.20it/s]
Output generated in 174.08 seconds (0.38 tokens/s, 67 tokens, context 749, seed 1770987472)

Edit: I think it does it 1 sentence at a time.

da3dsoul commented 1 year ago

Updated to pull the model dir from the args and added settings for the voices dir and sentence length. I'm curious how adjusting the length will affect it. My GPU doesn't even have 18GB of VRAM lol

Note for self: use this and add more settings EDIT: the above is done

Ph0rk0z commented 1 year ago

I know on MRQ there is some option that makes it use lower vram and other stuff. When I'm doing reggo voice cloning it only uses 6-8gb.

Coqi online does jankier cloning.. I wonder if the local copy does too. Trying it out.

per doc you can pass it: https://tts.readthedocs.io/en/latest/index.html?highlight=voices#python-api

tts = TTS("tts_models/de/thorsten/tacotron2-DDC")
tts.tts_with_vc_to_file(
    "Wie sage ich auf Italienisch, dass ich dich liebe?",
    speaker_wav="target/speaker.wav",
da3dsoul commented 1 year ago

Ok, Coqui VC path setting added, along with a bunch of other settings and some fixes

Ph0rk0z commented 1 year ago

It does indeed work.. I set the path manually, was the path supposed to go into the speaker box or another one?

da3dsoul commented 1 year ago

No, it doesn't have UI yet. A lot of the things I just added need UI. That's the next thing I'll do, probably tomorrow (9PM ATM)

EDIT: gradio UI is neat, but hard to make reactive....

St33lMouse commented 1 year ago

OK, time to get this thing installed. First problem: running the tortoise install scored me a tortoise folder with a hidden .git folder and nothing else in the tortoise folder. no requirements.txt, nothing.

I'm pretty dumb about this stuff, which makes me a useful tester, I think. Why shouldn't the user just git clone https://github.com/neonbjb/tortoise-tts ?

March me through this like the perfect idiot I am. I'm sure I'll stumble upon ever single unexpected barrier to installation and running the thing. But I want to see this work, so I'll stick with it.

Brawlence commented 1 year ago

@da3dsoul

I said it as an aside, but I'm going to tag you. I need testing/assistance on Tortoise. I either need help getting it to run alongside vicuna, which leaves 8GB of VRAM left to work with, and that seems reasonable

I implemented temporary model unloading in sd_api_pictures specifically to combat this. Additional 4-10 seconds to shuffle models around VRAM and RAM pale in comparison to Tortoise-TTS's innate 60+ seconds generation time on fine presets.

The up-front cost is pretty high, though. For this to work one needs to implement an API to unload and reload models in Tortoise-TTS (here's my pull for A1111 for reference). Then, in the extension scripts:

from modules.models import reload_model, unload_model

implement something akin to give_VRAM_priority() function

def give_VRAM_priority(actor):
    global shared, params

    if actor == 'TTS':
        unload_model()
        print("Requesting Tortoise to re-load voice generation model...")
        response = requests.post(url=f'{Whats the api to reload model in Tortoise?}', json='')
        response.raise_for_status()

    elif actor == 'LLM':
        print("Requesting Tortoise to vacate VRAM...")
        response = requests.post(url=f'{Whats the api to unload model in Tortoise?}', json='')
        response.raise_for_status()
        reload_model()
 …

And then just give_VRAM_priority(TTS) every time you want to generate audio.

da3dsoul commented 1 year ago

OK, time to get this thing installed. First problem: running the tortoise install scored me a tortoise folder with a hidden .git folder and nothing else in the tortoise folder. no requirements.txt, nothing.

I'm pretty dumb about this stuff, which makes me a useful tester, I think. Why shouldn't the user just git clone https://github.com/neonbjb/tortoise-tts ?

March me through this like the perfect idiot I am. I'm sure I'll stumble upon ever single unexpected barrier to installation and running the thing. But I want to see this work, so I'll stick with it.

It uses a sparse checkout to skip the unnecessary parts. I had issues getting it to work on Windows. I'm not sure why it wouldn't work. Needs looking into. I'm making settings UI for stuff, and Bawlence posted that nice bit of kit for me to look at, too, so I'll definitely be working on stuff today.

If you'd like to just get it running, the first line of the install_tortoise.sh can be run without the double hyphenated options and it should do a full clone

St33lMouse commented 1 year ago

Note: all this is with the conda environment active.

OK, yeah, that script gives me trouble. Did a straight up git clone:

git clone -n https://github.com/neonbjb/tortoise-tts tortoise but this fails to do the clone because of the rename, I think.

So now: git clone https://github.com/neonbjb/tortoise-tts, rename tortoise_tts to tortoise, cd tortoise, python -m pip install -r ./requirements.txt

This runs through the install, but fails with: note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed

Tried to run anyway, loaded chat mode and the tortoise_tts extension, reloaded the gui:

To create a public link, set share=True in launch(). Closing server running on port: 7860 Loading the extension "gallery"... Ok. Loading the extension "tortoise_tts"... Fail. Traceback (most recent call last): File "/media/mouse/eastwatch/text-generation-webui/modules/extensions.py", line 19, in load_extensions exec(f"import extensions.{name}.script") File "", line 1, in File "/media/mouse/eastwatch/text-generation-webui/extensions/tortoise_tts/script.py", line 8, in from .tortoise.tortoise import api File "/media/mouse/eastwatch/text-generation-webui/extensions/tortoise_tts/tortoise/tortoise/api.py", line 9, in import progressbar ModuleNotFoundError: No module named 'progressbar' Running on local URL: http://127.0.0.1:7860

Frustrating!

Went back to the tortoise directory: pip install progressbar, run server.py again. This time it wants rotary_embedding_torch. OK. pip install and run server.py again. gui loads in default mode. Switch to chat and select the extension, but gui won't load. No error messages in the terminal.

Finally I reloaded server.py, opened the gui and switched to chat mode without the extension. no problem. switched to chat mode with the extension, gui won't load, no errors in the terminal.

My guess is that tortoise had a bad install way back at the top when it complained about meta generation failed. I fear renaming the folder tortoise was not a good thing, but I'm not sure. This ends my first attempt. I'm going to try and install tortoise in the extensions folder using the tortoise installation instructions and see if I can avoid errors that way.

St33lMouse commented 1 year ago

Round two: I installed tortoise and got it to work by itself. Had to downgrade numpy and install a few more packages that didn't install themselves for whatever reason. I still got the meta data error, but I was able to run the tortoise script and got it to generate some voiced text for me. I'm getting better at installing this stubborn ai stuff...

Having got tortoise to work by itself, I renamed the tortoise tss to just tortoise so your script could find it, then ran server.py, switched to chat mode and loaded your extension. No luck--gui won't connect.

I went back to tortoise to make sure it would generate a clip with its directory renamed to just tortoise to make sure it worked, and it worked fine. So no problem there, although I'm still uncertain about that rename. Could be asking for trouble.

The only thing I've got left is to figure out what that metadata error was all about, but tortoise works in spite of that error, so I'm pretty much at a dead end with tortoise tts. May try the MRQ version shortly.

I notice that coqui extension has no install script, and requirements.txt simply says tts in it. Should I try to git clone coqui in that folder?

da3dsoul commented 1 year ago

Coqui only needs pip install TTS, as it's on pip.

Try the tortoise clone with

git clone https://github.com/neonbjb/tortoise-tts.git tortoise
cd tortoise
pip install -r ./requirements.txt
St33lMouse commented 1 year ago

OK, that worked this time. No idea why it failed last time. I note that tortoise has another step: python setup.py install which I went ahead and ran from inside tortoise. It then wanted a few more dependencies, easily installed. I tested a voice generation with tortoise by itself, and it worked.

Trying to load your extension in the gui fails, and trying to load it with python server.py --chat --extensions tortoise_tts

gets this error (and then the gui loads anyway)

(testgen) mouse@mousehole:/media/mouse/eastwatch/experiment/text-generation-webui$ python server.py --chat --extensions tortoise_tts bin /home/mouse/anaconda3/envs/testgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so Loading the extension "tortoise_tts"... Fail. Traceback (most recent call last): File "/media/mouse/eastwatch/experiment/text-generation-webui/modules/extensions.py", line 19, in load_extensions exec(f"import extensions.{name}.script") File "", line 1, in File "/media/mouse/eastwatch/experiment/text-generation-webui/extensions/tortoise_tts/script.py", line 94, in model, voice_samples, conditioning_latents = load_model() File "/media/mouse/eastwatch/experiment/text-generation-webui/extensions/tortoise_tts/script.py", line 87, in load_model tts = api.TextToSpeech(models_dir=os.path.join(models_dir, 'tortoise'), device=params['device']) File "/media/mouse/eastwatch/experiment/text-generation-webui/extensions/tortoise_tts/tortoise/tortoise/api.py", line 231, in init self.autoregressive.load_state_dict(torch.load(get_model_path('autoregressive.pth', models_dir))) File "/home/mouse/.local/lib/python3.10/site-packages/torch/serialization.py", line 771, in load with _open_file_like(f, 'rb') as opened_file: File "/home/mouse/.local/lib/python3.10/site-packages/torch/serialization.py", line 270, in _open_file_like return _open_file(name_or_buffer, mode) File "/home/mouse/.local/lib/python3.10/site-packages/torch/serialization.py", line 251, in init super(_open_file, self).init(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: 'models/tortoise/autoregressive.pth'

Ph0rk0z commented 1 year ago

Its supposed to download that stuff the first time you run it. Did you load the extension from the command line? First time you have to.

St33lMouse commented 1 year ago

You mean this here: python server.py --chat --extensions tortoise_tts will pause and download that stuff? My internet is kind of sticky today, so it must have timed out. I'll try again.

Ph0rk0z commented 1 year ago

Yea.. and from the UI it won't.

da3dsoul commented 1 year ago

I'll look at it. @St33lMouse the difference in the commands is -n which means no checkout. You don't want to run setup.py install mostly because that will install it to the conda environment, and it might cause package conflicts if you wanted to try each of the versions of tortoise side by side. The extension loads tortoise from the folder it cloned, rather than via an installed package

St33lMouse commented 1 year ago

If I have to do another reinstall I'll avoid the setup.py mistake. OK, got coqui interface to load. I loaded the example character and was surprised to hear a voice start talking to me right away!

I don't see any folders or support files or anything. How do you change voices and stuff? Speaker and language boxes in the UI are empty.

It's getting late for me, so I'll be back in another 10 hours or so and try and get tortoise going. We should pay attention to my little installation problems here and spell things out clearly for other users. After months of installing AI things like this I have some idea about what to do, and I still run into lots of problems. Others will be completely clueless. Even things you think are obvious are not at all obvious to newbies, and they'll either bug you with questions or just give up and go away.

da3dsoul commented 1 year ago

All of it needs more UI work. You can use the settings in the Python file and presumably the built in settings that I haven't figured out yet. I've been trying to figure out the gradio UI framework callbacks

St33lMouse commented 1 year ago

I'm still getting this error when I try to load the tortoise extension from server.py from the terminal:

FileNotFoundError: [Errno 2] No such file or directory: 'models/tortoise/autoregressive.pth'

The error comes up very quickly. I don't think it's trying to download any files. How can I get that missing stuff downloaded?

da3dsoul commented 1 year ago

I'm not sure. I've not been able to reproduce it yet

Ph0rk0z commented 1 year ago

Try to run the tortoise manually with stuff from it's repo. That will fix the error. See I used tortoise beforehand and those files all downloaded.

St33lMouse commented 1 year ago

OK...today's installation efforts...success! I finally realized that when it was asking for those files at models/tortoise/autoregressive.pth'

it actually meant it wanted a tortoise folder in the text-generation-webui/models folder, NOT in the tortoise installation. I don't see how those files could get downloaded automatically, so I dredged them out of my MRQ installation and put them there. That actually worked! But this folder structure is going to give other people problems, just like me. At the very least, they'll have to be told where to put those files.

So now the extension loaded, and it...worked! Slow as a tortoise, of course. The initial assistant greeting is long and took 400 seconds to generate. I set it to ultra fast and it took 2 minutes to generate a 14 second clip. Are you sure the speed settings are working?

I like how you print the text and also show the sound file. Personally, I want to see that text.

Memory usage is heavy. I have pyg6B loaded, and tortoise pushes me close to the edge. Nothing you can do about that, I figure. My plan is to use alpaca 13B 4bit, which uses about 8gb vram, I think, that should be a good balance and room for safety.

Voices could use a refresh button in case the user drops more voices in the folder during a session. Probably not a high priority thing.

Next up, I'm going to try and get tortoise fast and mrq working. I'm not sure you need them both. As a matter of fact, you might just want to go with tortoise fast and drop the other two. MRQ's purpose is voice training, though, And a trained MRQ voice is pretty cool. I'll explore this more carefully.

St33lMouse commented 1 year ago

Dropping a new voice into the voices folder does not update the list of possible voices, I think. And surprisingly throwing out the files in an existing voice and replacing them with your own also does not change it. This is whole point of Tortoise, so it's worth fixing, I'd say. Maybe this would work in MRQ right off the bat. I'll have to check.