oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
38.66k stars 5.1k forks source link

Coqui_TTS needs TTS updating or it will keep downloading the model. Also sounds Strange (FIX) #4723

Closed erew123 closed 7 months ago

erew123 commented 8 months ago

Describe the bug

When Coqui_TTS starts up it will download the model, saying it has as update, every time it loads the model into memory. Resulting in a 1.8GB download each time. (They just updated the model and TTS about 2 hours ago)

It needs a "pip install --upgrade tts" to bump it to v0.21.1

EDIT - Manual workaround here along with sounding strange FIX https://github.com/oobabooga/text-generation-webui/issues/4723#issuecomment-1826120220 (for anyone looking for it)

Is there an existing issue for this?

Reproduction

load the Coqui_TTS tts_models/multilingual/multi-dataset/xtts_v2 model into memory on an older version of the TTS engine e.g. v0.20.6

Screenshot

N/A

Logs

N/A

System Info

N/A
HolgerBr65 commented 8 months ago

I have the same problem. Is there any solution?

erew123 commented 8 months ago

EDIT - Manual workaround here along with sounding strange FIX https://github.com/oobabooga/text-generation-webui/issues/4723#issuecomment-1826120220

Run the CMD_yourOS file in the text-gen-webUI folder.....to put you into your python environment.

Then run pip install --upgrade tts (Once you have done this, it will only download the new model once)

BUT..... the new model https://huggingface.co/coqui/XTTS-v2/tree/main which it downloads, isn't sounding right.

A couple of us are talking about it here https://github.com/coqui-ai/TTS/discussions/3301#discussioncomment-7662825

I think Ill be updating TTS (as above) but downloading the old model and probably blocking python from going out on my firewall (for now) to stop it updating to the new model.

Old model is here on this link https://huggingface.co/coqui/XTTS-v2/tree/v2.0.2

HolgerBr65 commented 8 months ago

What a mess. I will follow your idea and see how it turns out. I hope someone can fix this. Just running it all with internet turned off or blocking python through the firewall seems messy to me as well. Thank you anyway for your help here

erew123 commented 8 months ago

Just had time to look at this again. The manual workaround is:

1) Open a command prompt/terminal 2) Navigate to your text-generation-webui folder 3) Run the correct CMD file for your OS, to start the python environment up https://github.com/oobabooga/text-generation-webui#running-commands 4) Run pip install --upgrade tts

The update to TTS will probably get captured by the main update https://github.com/oobabooga/text-generation-webui#getting-updates when the Coqui_TTS extension requirements file is updated, but for now, you can do the above.

The next time the Coqui_TTS extension is loaded, it will probably download the model the 1x last time (it may not too) but let it complete if it does download.

Though, quite a few people seem to think the 2.03 model is sounding strange, so to drop back to the 2.0.2 model:

5) Download the 2.0.2 model.pth and vocab.json from the 2.0.2 model https://huggingface.co/coqui/XTTS-v2/tree/v2.0.2 6) Find where the tts_models--multilingual--multi-dataset--xtts_v2 folder is on your computer. In windows this is: C:\Users\ YOUR-USER-ACCOUNT \AppData\Local\tts\tts_models--multilingual--multi-dataset--xtts_v2 The Linux location is probably: /home/${USER}/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2 7) Copy those 2x files over the top of the files in there. Do not change/delete or touch any other files in there!

Then it starts to generate audio like the previous version did (without the strange accent) and it doesn't demand you download the model each time! :)

erew123 commented 8 months ago

For the Devs of the Coqui_TTS Extension

After a bit of discussion on the Coqui_TTS forum, it looks like you can use --model_path and --config_path to run any model you choose or download manually.

My discussion here https://github.com/coqui-ai/TTS/discussions/3301#discussioncomment-7664165

It may therefore be best if people don't like the sound of the new 2.0.3 model, the Coqui_TTS extension downloads the 2.0.2 model on first run, checking for its existence on prior runs, and then uses model_path and config_path to load the model.

This would mean that we could version control future model releases I guess. The 2.0.2 model is hosted here https://huggingface.co/coqui/XTTS-v2/tree/v2.0.2

So the below would need to be model_path and config_path.

"model_name": "tts_models/multilingual/multi-dataset/xtts_v2",

and

def load_model():
    model = TTS(params["model_name"]).to(params["device"])
    return model

Please also see my suggested updated script.py for low VRAM handing https://github.com/oobabooga/text-generation-webui/issues/4712#issuecomment-1825593734

q5sys commented 8 months ago

6. (not sure on the Linux location)

/home/${USER}/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2 (at least on RHEL based distros)

erew123 commented 7 months ago

closing this ticket off. I assume anyone who searches will find it in the closed section.

testter21 commented 7 months ago

any idea how to set this in xtts in pinokio browser (macos)? apparently xtts will not start without internet connection, and it will always update to most recent model, which is wrong.

erew123 commented 7 months ago

@testter21 you can download the updated version that I built if you like. Ive tested for you and as long as you have run it with an internet connection once, it then works offline with the "API Local and XTTSv2 Local" methods. API TTS may well need an internet connection, so dont click on that one if offline!!

I've not yet managed to separate it off into its own simple download, so you would have to do these instructions.

you would need to download these files into your existing /extensions/coqui_tts/ folder: config.json modeldownload.json modeldownload.py script.py requirements.txt tts_server.py

from here: https://github.com/erew123/text-generation-webui/tree/main/extensions/coqui_tts

then create a subfolder called templates e.g. /extensions/coqui_tts/templates and download the generate_form.html into it.

from here: https://github.com/erew123/text-generation-webui/tree/main/extensions/coqui_tts/templates

to download a each file individually, you click on each file one by one, then click the "download raw file" button: image This version will store a separate model file from the TTS python service, but that does mean it WILL download a file on first start-up. It may well tell you to "pip install --upgrade tts" at your command prompt. Ive made this version verbose at the command prompt for things like that. image

There is a full manual and settings page included in this one. You will see the link on the settings interface! image image

testter21 commented 7 months ago

@erew123, quick question.

With oobabooga, as I'm not programmer, I followed these instructions: https://www.youtube.com/watch?v=lZkQUOpLg6g adjusting to macos as this fellow refers to windows.

Theoretically all went well until at some point (6'35" on the video) I ended up with an error:

... Closing server running on port: 7860 2023-12-08 10:37:38 INFO:Loading the extension "gallery"... 2023-12-08 10:37:38 INFO:Loading the extension "coqui_tts"... [XTTS] Loading XTTS... tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded. Using model: xtts 2023-12-08 10:37:56 ERROR:Failed to load the extension "coqui_tts". ...

Thus, the tts gui part is not being included. Will your suggestion fix that problem too?

*

As for pinokio browser, xtts folder/data structure is different there.

erew123 commented 7 months ago

Ive only ever seen one other person with that issue! Which strangely I responded to on here: https://github.com/oobabooga/text-generation-webui/issues/4718

From your error message its not too clear why it didn't load on your system. Perhaps TTS isnt properly installed.

If you download the one I suggested and when you have downloaded it, go into the extensions/coquii_tts folder at the command prompt/terminal (being a mac I guess terminal) and type

pip install -r requirements.txt

that should ensure that all the necessary files are installed.. in theory!

Beyond that, when the version I sent you starts up, it performs a few checks and should warn you if there is something missing/wrong (please see my screenshot on my last post and the "warning" message it gave there). So its more likely to tell you what to do, if something else needs doing.

testter21 commented 7 months ago

The person was not me. I'm guessing rather very few people attempt to install this stuff, as this usually requires either some programming skills or ability to navigate in file/folder structures and some logic.

I did not posted more info on this error, because I need to reproduce the steps again to see where something could go wrong. Technically, oobabooga says, that it makes an isolated install (thus, dependencies are stored in it's own folder), but during the installation steps, I indeed had some, well... not sure if this was an error or ambiguity issue, but it said somethig related to macwhisper (now, I don't know if this is included in oobabooga as scripts, but I do have macwhisper installed as an app; if installers interfered with external app, then this is not isolated install as stated on the webpage).

As for now, I'm a bit lost with all this, but I will follow your steps and see where it leads to.

testter21 commented 7 months ago

okay, I'm reinstalling oobaboga.

So I unpacked text-generation-webui-main, started start_macos.sh, selected vendor (apple m in this case). After that, during installation I get something like this:

... Downloading werkzeug-3.0.1-py3-none-any.whl (226 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 226.7/226.7 kB 2.8 MB/s eta 0:00:00 Installing collected packages: Werkzeug, sniffio, itsdangerous, click, blinker, tiktoken, Flask, anyio, starlette, flask_cloudflared, sse-starlette Attempting uninstall: tiktoken Found existing installation: tiktoken 0.3.3 Uninstalling tiktoken-0.3.3: Successfully uninstalled tiktoken-0.3.3 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. openai-whisper 20230918 requires tiktoken==0.3.3, but you have tiktoken 0.5.2 which is incompatible. ...

It's the same error I had previously, and I don't know if it's related to the fact that tts gui part is not loading later.

and later:

...


WARNING: Skipping torch-grammar as it is not installed. Uninstalled torch-grammar Requirement already satisfied: ...

and then at the end again:

... Uninstalling starlette-0.33.0: Successfully uninstalled starlette-0.33.0 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. openai-whisper 20230918 requires tiktoken==0.3.3, but you have tiktoken 0.5.2 which is incompatible. Successfully installed GitPython-3.1.40 ...

and at the end of course this:


erew123 commented 7 months ago

Thats not related to the Coqui_tts extension its self. It looks like you arent using the default python environment that text-generation-webui sets up... though not having a mac, Ive never tried it on mac, though I assume it to be similar to linux.

My suggestion would be, to get yourself to a stable/known position.

1) download the one click installer into a new folder https://github.com/oobabooga/text-generation-webui#one-click-installers and run start_macos.sh

This will re-download and setup the whole of text-gen-webi and build a new python environment for it.

2) Every time you want to run it, you run start_macos.sh that ensures it is loading into the correct python environment with all the correct requirements. If you DONT run it this way, you will load up a different python environment that could have any settings in it...this may even be your current problem!

If you have an issue beyond that I would report it on the "issues" page on here, as that would be a general issue with the installer for it.

testter21 commented 7 months ago

https://github.com/oobabooga/text-generation-webui#one-click-installers and run start_macos.sh

I used exactly this one and started start_macos.sh as in description. I redownloaded just in case, but get exactly the same errors

erew123 commented 7 months ago

If youve done that, gone through the whole install and next time you try running it with start_macos.sh its giving errors, Id post on the issues page here https://github.com/oobabooga/text-generation-webui/issues

Either oobabooga or someone with a mac will take a look. You'll need to post the text output of the issue you're getting when you run start_macos.sh as that should tell them what's wrong.

It sounds like it could be a requirements file issue, maybe, but they would have to take a look at that (someone whom works on the core code and has access to a mac). You can also hunt that issues page for others with the same problem (also look for closed issues).

testter21 commented 7 months ago

I reposted notes from here. I see what happens if I add on a clean install your srteps, whether it will include tts gui or not.

btw, such installments work like windows portable apps, i.e. can be migrated between folders or computers?

erew123 commented 7 months ago

The things in the extensions folder should move fine. As for moving the whole of Text generation webUI... umm.. possibly, though you might have to run the setup_YOUROSHERE file. Ive never tried, so I can only say hypothetically.

testter21 commented 7 months ago

In essence it would be good if you could relatively freely move the whole thing around and and separately backup/reuse subfolders. I guess some config file would be then needed to point starting path and inclusions if new stuff is added. I have no idea whether this works as I described just now, I'm guessing it would be reasonable course of action.

btw, as for pinokio browser, I just got enigmatic response on coqui discord:

Me: How to disable automatic model downloads in xtts in the pinokio browser (macos)? Most recent model is buggy, I't like to keep earlier iteration of 2.x Reply: tts = TTS("xtts_v2.0.2", gpu=True) Me: In which file? Reply: It's for if you're calling TTS programmatically. We don't have anything yet for the CLI.

whatever that means.

erew123 commented 7 months ago

Updating to TTS 0.21.3 stops the continuous download occurring (hence the warning on the version I built if you look at the screenshot, along with the command line instruction for performing the update)

My version also allows you to use both the 2.0.2 and latest (2.0.3) version of the model, simultaneously (as the 2.0.3 model sounds bad). All of that is detailed in the settings page after its installed and up and running.

If you just want to stop it re-downloading the model all the time, you need to be in the text generation webui python environment, so either start the text generation webui at a command prompt with start_macos and then ctrl+c to exit that (it should have loaded the environment) then:

pip install --upgrade tts

image

testter21 commented 7 months ago

So far.

  1. although with some errors mentioned above, oobabooga is istalled, apple m series selected.
  2. 6 files from coqui_tts are downloaded and placed in correct folder (some old files are replaced, so no issue here), template file/folder handled too
  3. pip install -r extensions/coqui_tts/requirements.txt handled
  4. oobabooga started via start_macos.sh; when enabling in session coqui_tts, after applying, tts files (models and so on) were downloaded.

and then:

... [CoquiTTS Startup] DeepSpeed Not Detected. See https://github.com/microsoft/DeepSpeed [CoquiTTS Model] XTTSv2 Local Loading xttsv2_2.0.2 into cpu [CoquiTTS Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 60 seconds maximum [CoquiTTS Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 60 seconds maximum ERROR: Traceback (most recent call last): File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/starlette/routing.py", line 677, in lifespan async with self.lifespan_context(app) as maybe_state: File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/contextlib.py", line 204, in aenter return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/extensions/coqui_tts/tts_server.py", line 46, in startup_shutdown await setup() File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/extensions/coqui_tts/tts_server.py", line 110, in setup model = await xtts_manual_load_model() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/extensions/coqui_tts/tts_server.py", line 154, in xtts_manual_load_model model.cuda() File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 918, in cuda return self._apply(lambda t: t.cuda(device)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 833, in _apply param_applied = fn(param) ^^^^^^^^^ File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 918, in return self._apply(lambda t: t.cuda(device)) ^^^^^^^^^^^^^^ File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/torch/cuda/init.py", line 289, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled

ERROR: Application startup failed. Exiting. [CoquiTTS Startup] Warning TTS Subprocess has NOT started up yet, Will keep trying for 60 seconds maximum ... [repeats multiple times]... [CoquiTTS Startup] Startup timed out. Check the server logs for more information. 2023-12-08 13:35:51 ERROR:Failed to load the extension "coqui_tts". Traceback (most recent call last): File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/modules/extensions.py", line 36, in load_extensions exec(f"import extensions.{name}.script") File "", line 1, in File "/Users/xxxx/Downloads/oobabooga/text-generation-webui-main/extensions/coqui_tts/script.py", line 206, in sys.exit(1) SystemExit: 1

erew123 commented 7 months ago

Got it! Ok, Its looking for CUDA, which you wont have on a mac, because CUDA is for an Nvidia card. Its something that I can code into a new version.... though Im still working on separating this out into its own install, so as not to impact the supplied Coqui_tts extension (Which will have exactly the same issue, as this version is a derivative version). I'm 40-60% of the way through splitting this out into its own version......and once I have that done and tested Ill see what I can do about the cuda thing, though Im guessing theres 2-3 hours of coding involved. Ill let you know when I get it done.

testter21 commented 7 months ago

Ok, thanks! Let me know if anything else is needed.

btw, there is also this missing tts gui part bug. is this related to oobabooga in general, or current situation?

erew123 commented 7 months ago

I cant say. Possibly because the extension didnt load.

testter21 commented 7 months ago

@erew123 - any luck?

erew123 commented 7 months ago

@testter21 Not yet no. It took me until about 2-3 hours ago to finally build a release based on the main code I was re-writing (problems along the way) https://github.com/erew123/alltalk_tts .

I was literally at it (coding+writing documentation) for roughly 14 hours yesterday. Im taking a little mental break this evening. I have loosely had a bit of space to think my way around mac and using system ram only, now that I've got the main bit out of the way Im going to take a little downtime, gather my thoughts and Ill let you know.

testter21 commented 7 months ago

Ok, have a nice weekend then.

btw, this is pinokio version of xtts: https://github.com/cocktailpeanut/xtts.pinokio It works on mac m-series (despite lack of model version management and token limit at 400), so maybe it will give some help.

erew123 commented 7 months ago

@testter21 You're welcome to go give it a go https://github.com/erew123/alltalk_tts

I've done what I can to put logic into the code to ensure it wont attempt to load into an Nvidia graphics card, when a system doesn't have an Nvidia graphics card. Unfortunately its not as simple as a couple of changes, due to the amount of logic already going on inside the scripts.

But, as I say, I've done what I can. I've also separated out the requirements files, so you will be installing the requirements_other.txt

All the instructions are written up on the front page of the above link.

I guess it only remains to say, I don't have a mac so I haven't tested it out on a mac. In theory, it should work, but I cant say for certain you might not encounter some error.

testter21 commented 7 months ago

Ok, thanks for the effort, I will check it and see where it goes. I'm determined to have this running here, so in any case I can help with testing on mac.

btw, what's (in this case) the relation of Coqui_TTS to bark voice cloning? Is one built onto the top of another, or these are two separate workflows?

testter21 commented 7 months ago

Okay, so this is the workflow on mac, and things you will see on typical machine. workflow.zip

It seems to work, nice job there. In preview box, when I type something in my language, text generation is present (initially I had wrong output selected on the device).

And now the big question - how I can use it as regular tts generator with downloadable output? Without bothering about chat. (So far, I don't see yet anyway any multilingual local models, that would be suitable for reasonable data processing)

How it will handle longer chunks of text? This is recent observation. On pinokio browser, I noticed that there is a token limit (I don't know what it refers to - phonemes or words). I'm not sure whether text there is processed 'sendence by sentence' or 'as is'. Technically, if tts itself has input token limit, sentences could be split (primarily by dots, secondarily by comma) in preprocessing.

testter21 commented 7 months ago

Hm... So basically it would have to be something, a dummy that sends user input to output window (text passthru)

erew123 commented 7 months ago

Thanks for documents. That at least confirms to me it does what its supposed to do! It looks like you may have a decent speed on generating voice samples too, despite being on a cpu/ram setup.

Your first question Coqui, Bark and Xtts Coqui, they are an open source company/foundation for making text to speech engines and that kind of thing https://coqui.ai/about. So they manage the core TTS code and various different models/methods for generating text to speech. Other companies and people can submit code or models to them, hence you will see other peoples/companies names flying around all over the place on their site.

Bark is one of their models/methods for voice cloning https://tts.readthedocs.io/en/latest/models/bark.html which I believe requires a larger voice sample and "training" on those voice samples so has a higher memory requirement and cant just clone voices without creating a trained RVC file ahead of time. It may however be able to generate more accurate sounding results (unsure).

XTTSv2 is another model/method for voice cloning https://tts.readthedocs.io/en/latest/models/xtts.html that doesn't require training ahead of time, it just requires a 6-12 second voice sample and it generates on the fly.

So effectively, they are just 2x different ways/methods of doing voice cloning, each with their pros and cons. If you scroll down the left index on the links above, under the "TTS" section, you will see various other models they manage e.g. Tortoise etc.

AllTalk is working with the XTTSv2 method.

Generating TTS content. Now that you have the TTS engine installed within your text generation webui Python environment, as long as you are loaded into that Python environment you can actually just use TTS in your terminal window https://tts.readthedocs.io/en/latest/inference.html below is an example command line (the bits in bold you would not change:

tts --text "this is my text I want to hear" --model_name "tts_models/multilingual/multi-dataset/xtts_v2" --out_path output.wav --speaker_wav voices/female_01.wav --language_idx en

The problem is with that method, that each time you run it, it has to load the model into memory each time, so that adds 15 seconds onto each voice generation.

The other way you could do it, is to run my script, which will load a model into memory and keep it there. It can be controlled via CURL commands (from a web page if you wanted to make one or from the command line). I have included the format of CURL commands in the documentation built into AllTalk, so have a look there for the commands to use! (pretty sure CURL will be installed already on a mac, if not, its a tiny thing).

Technically speaking, if you wanted to, you could build a custom Python environment that is purely for TTS, no need to have the Text generation webUI and its requirements installed and either use the command line TTS as I showed OR you could run my script from its folder "python script.py" and it would load into memory (though I have a change/update to make before it will run as a standalone, so you would need to update AllTalk again, when ive made that change, if you wanted to go down that route.).

Longer Sentences There are 2x generation methods the XTTS and TTS engine use for generating text. I list these in my interface as "API" and "XTTSv2" (where you can select the model/method). The XTTS method requires that the model is loaded into memory at all times and you can throw as much text at it as you like in one long paragrapgh *Note below on this

The other method, like I showed you above "tts --text "this is my text I w......." or "API" as I call it, is called model to file. and will split generation of very long lines into individual sentences for each bit it generates (aka processed 'sentence by sentence'), before combining them into one wav file. So it may be better for longer text generation.

Long sentences/paragraphs are a problem. The reason being that when it is generating the speech, it has to look at the sample wav file/voice you provided it and try to keep the generation on track so it sounds correct. The longer the generation you are making, the harder it gets for it to keep on track and the voice starts to waver and sound strange. So typically there is a limit, hence you may have to split long paragraphs into individual sentences, so that each sentence generates nicely. I believe the "API" method, will always split into sentences automatically. The XTTSv2 method, I had to write code to clean up the text, split it into sentences and do other such things, though, that code wont be used if you are sending text to my engine via CURL commands.

Its a very long and complicated thing, and Im not expert on it all to be honest. but hopefully some of the above answers some of your questions!

testter21 commented 7 months ago

Thanks for detailed info, it clarifies some things. During following days, I will see how it works.

btw, so currently macbook m gpu's are not supported by xtts in anyway?

erew123 commented 7 months ago

I have no real clue to be honest. I did a quick search and found this from 1 year ago https://github.com/coqui-ai/TTS/discussions/2208

You may want to post the question there, in their discussions forum, as they are the people who will 100% know

testter21 commented 7 months ago

tts --text "this is my text I want to hear" --model_name "tts_models/multilingual/multi-dataset/xtts_v2" --out_path output.wav --speaker_wav voices/female_01.wav --language_idx en The problem is with that method, that each time you run it, it has to load the model into memory each time, so that adds 15 seconds onto each voice generation.

As of right now, I tested this part, and it works. In the meantime, we tried to run bark with project owner within other repo, but for now it doesn't work, at least not on mac-m so that one will have to wait.

On a side note here, it seems that text has to be pre-filtered from some characters (like "-"), otherwise it will not work.

The other way you could do it, is to run my script, which will load a model into memory and keep it there. It can be controlled via CURL commands (from a web page if you wanted to make one or from the command line). I have included the format of CURL commands in the documentation built into AllTalk, so have a look there for the commands to use! (pretty sure CURL will be installed already on a mac, if not, its a tiny thing).

Technically speaking, if you wanted to, you could build a custom Python environment that is purely for TTS, no need to have the Text generation webUI and its requirements installed and either use the command line TTS as I showed OR you could run my script from its folder "python script.py" and it would load into memory (though I have a change/update to make before it will run as a standalone, so you would need to update AllTalk again, when ive made that change, if you wanted to go down that route.).

I would appreciate some update, so that this can be used via gui.

Long sentences/paragraphs are a problem. The reason being that when it is generating the speech, it has to look at the sample wav file/voice you provided it and try to keep the generation on track so it sounds correct. The longer the generation you are making, the harder it gets for it to keep on track and the voice starts to waver and sound strange. So typically there is a limit, hence you may have to split long paragraphs into individual sentences, so that each sentence generates nicely. I believe the "API" method, will always split into sentences automatically. The XTTSv2 method, I had to write code to clean up the text, split it into sentences and do other such things, though, that code wont be used if you are sending text to my engine via CURL commands.

I noticed, that token limit probably relates to single sentences, not whole text in paragraph. Technically, there is a way around for too long sentences. By simply preparing and modifying the text firtst. For audiobook production, text preparation is not that big issue, and I guess it could be automated to some degree. The first step would be to indicate too long sentences by token count. Then, these can be reedited manually (or by comma and other separators). When long ago I was using tts to make spoken content, at the end, what didn't sound right, had to be resynthesized and repasted. This can be easily done in apps like Reaper.

erew123 commented 7 months ago

On the API calls, I am not performing any filtering of the text you input currently. So yes, clearing out special characters etc is best done to avoid strange sounds or issues. I may in future provide a separate API call that will give you the option to push any text sent to it through my filter that I use for the AI models. Its just not been a core focus at the moment.

I was discussing this with someone here, today https://github.com/erew123/alltalk_tts/issues/3

I would appreciate some update, so that this can be used via gui.

What do you mean? a text box you can type in and press a "generate audio" button?

Token limit

A few days ago, I added automatic sentence splitting on the "XTTSv2 Local" method. Obviously you would have to update to have that enabled. Though saying that, Im still not sure what the actual limit is there.....I mentioned this to the person I was talking with above. You may want to check that for a few more details on the API (as far as I have gotten).

erew123 commented 7 months ago

If it was a web page, this now exists: image

If it is other methods for playing, Im working on a separate API that will have all options. Though that probably wont be done for a few days.

testter21 commented 7 months ago

My apologies for late reply, flu season.

In the meantime I did some proof of concept experiments with xtts in pinokio implementation, as I figured out what lines to mute in which files, so that I can use model version I want. Apparently this works.

My tests there showed following, as I pushed a 40 miniutes long lecture through it. It's possible that some thoughts are valid here to. (I will check your update in a few days).

  1. I had to prepare text manually, i.e. I had to shorten some sentences, which simply broke the generation at all. So what I did is, when I saw the sentence is very long (these lecrures came from spoken word, so...), I looked for neighbouring comma to split it. Automation in this regards would probably count approximate tokens (characters?) per sentence and decide whether and where to split or not. Maybe general split by comma and not dot is not bad idea too.
  2. I had to split the generation into chunks (separate generations). The 40 minutes long script breaks (stops) the generation somewhere in the middle. Or it's related somehow to above. But since it's time consuming, it was better to do that. If feeding model with too many phrases is indeed problematic, then maybe the solution is to reload model from time to time for a fresh start.
  3. From what I see, per 40 minutes generation, there are approx 30 mark points, when phrase has to be regenrerated. Model (all versions?) tends to produce trash data, like halfwordings in foreign languages or human0like huffing and puffing. On a side note, and this is good - regenerating the same phrase - each times produces different articulation (old tts methods had fixed sounding for these).

In other words, making audiobooks is possible with cloned voice text generation, but still a bit time consuming. But results are decent and satisfactory

What I'm wondering for this model -are there any modifiers for mood, pitch, speed, accent (stc) control?

erew123 commented 7 months ago

Hope you are feeling improved. Flu is never good.

Speed yes https://docs.coqui.ai/en/dev/models/xtts.html#inference-parameters the others, not currently, though I understand they are planned.

As for pushing other things through it, its about another 2-3 updates since last week and there is now a full API suite for it https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-api-suite-and-json-curl

I think you said you are on a Mac, which as I understand (though could be wrong) DeepSpeed isn't available for it. Which is a shame as that would give you a 2-3x speed increase.

Typically trash data/noises (in my experience) are mostly caused by anything that isnt basic punctuation and even then sometimes by basic punctuation like ! ? etc. Generally filtering these out gets better results (the new API supports some levels of those filtering methods, though you just cant avoid everything).

Obviously, it would be reasonably simple to have some kind of script that could split sentences by whatever method (periods etc) and just keep firing chunks of text at a TTS). Obviously, combining it after is a question. There are methods to iterate through and do things like that, esp if you number your files, making it easy to do in the correct order. (The API supports file naming).

I also did a vocabulary update at some point that further improves the quality of the speech output, but I cant recall precisely when that was done. Some time in the last 10 days.

You may also find that finetuning a voice improves the quality of the output, however, you would need a system with an Nvidia card for that.

testter21 commented 6 months ago

Yes, I'm accessing this on mac arm.

As for trash data, simple experiment narrows this problem. Just paste, multiple times, the same short sentence. At some point you will notice strange audio artifacts, in this case - in foreign languages, from what I saw. So this issue is outside punctation.

As for finetuning, I don't see any practical reasons for it, at least not at this point. Adding more training data may (or may not?) make tts speech sound more similar to person's voice, but the same is true for giving xtts different short voice samples of the same person. So it's a matter of finding samples, that translate well to the cloning. On the other hand, currently trained model - in some cases is better than original person in terms of articulation. So it's a good trade off, I think.

As for audiobook creation, I did a proof of concept, to see, how this can be made, with some visual content, and how much times it takes. Audio can be adjusted in Reaper, great tool for such kind of stuff. Spoken text has to be translated into subtitles. The easiest way seems to be macwhisper, which re-transcibes spoken text, and creates so-so synced subtitles. Subtitles can be re-processed in Davinci Resolve (this seems to be the only app, that handles well subtitle editing for audiobooks), which also great video editor. So this is manageable and relatively pleasant workflow, although it takes few hours in total per 30mins of data (if you want to have quality material).

erew123 commented 6 months ago

@testter21 I think this is what you are after...

image

Its almost finished

q5sys commented 6 months ago

Long sentences/paragraphs are a problem. The reason being that when it is generating the speech, it has to look at the sample wav file/voice you provided it and try to keep the generation on track so it sounds correct. The longer the generation you are making, the harder it gets for it to keep on track and the voice starts to waver and sound strange.

Is this in anyway related to why the first few seconds always sound bad? Because I run into that all the time, the first 2 seconds or so sound like the voice is drunk, and then it gets really clean for a while before getting drunk again at the end of long sentences. When using the 'demo' option in oobabooga, I've gotten to the point of padding what I need audio for with about 3-4 words at the beginning and end and just clipping them off in audacity later.

I'll definitely be checking out your work on, alltalk_tts That looks like a very nice simplified way to get the audio I need.

erew123 commented 6 months ago

It can be multiple things that cause bad audio. Certain characters slipping through e.g. three dots ... (causes it to go oohwowowhhh kind of sound). So Ive done lots in AllTalk to try filter things like this.

Your Audio sample can also cause some of these effects. So if you have AllTalk and you try "female_01.wav" the model was finetuned on that, so you will notice its very unlikely to produce any strange sounds on that (as long as you dont have ... etc slipping through). So good quality samples are important.

AllTalk includes the option to finetune models if you want.

Other than that, you do just get the occasional bad sound and you arent meant to ask the AI to produce lines longer than 250 characters at a time. In AllTalk I have enabled sentence splitting, but you can still get the odd thing slip through. This is why Ive made a long form generator that will split things into shorter productions that you can merge together at the end..... though ive not published it yet. It will be part of AllTalk soon.

q5sys commented 6 months ago

@erew123 Thanks for the response. I'll be following your work on AllTalk.

testter21 commented 6 months ago

@erew123, sorry for late reply, but the beginning of the year is always busy time.

On the photo, what is the "chunk sizes" referring to? (because it's not sentence or paragraph from what I see on this example).

Btw, optional (checkbox) splitting text by "sentences" (dot, question mark, exclamation mark, semicolon, colon) seems to be good idea too. This way, source text outside this - can be indexed separately in completely different workflow (like translation setups, which follow well sentence-by-sentence routines), and if numbering is the same, then it whould be match between these. I said optional, because I'm not sure if xtts would handle this so well as with longer chunks (timbral and flow continuity).

So far, for test, I made two "visual audiobooks" using xtts ( link1 ), and major time consuming task is regenerating faulty sentences. So, listening to sentences one by one, and one-click regeneration of the wrong ones seems to be good idea. Then, I guess, it would be nice if playback list was configurable (checkbox single/continuous) to continue from last clicked segment.

Also, and it would be helpful for making visual audiobooks - it might be useful to export text in srt format, synced accordingly with audio segments lengths. I'm not yet sure if I'm correct with this one, but synced editing (shifting of synced audios and text blocks) in davinci resolve would do the rest. re-splitting of long text sentences from srt source can be done in davinci, as this has to be synced manually anyway (davinci doesn't transcribe in many languages, and macwhisper does so-so job, not that great).

Now let see updates in your project.

erew123 commented 6 months ago

@testter21 "Chunk sizes" refers to sentences. It is not split by colon as that would break how the AI looks at pronouncing the TTS. Splitting by anything other than standard punctuation that forms an entire sentence will cause the AI to deviate from its pronunciation, hence only splitting by entire sentences (1, 2, 3 etc).

Obviously you can choose where to play back from image I can look at a checkbox to stop it resetting the playback location at some point.

This Also, and it would be helpful for making visual audiobooks - it might be useful to export text in srt format, synced accordingly with audio segments lengths. I'm not yet sure if I'm correct with this one, but synced editing (shifting of synced audios and text blocks) in davinci resolve would do the rest. re-splitting of long text sentences from srt source can be done in davinci, as this has to be synced manually anyway (davinci doesn't transcribe in many languages, and macwhisper does so-so job, not that great). sounds very complicated and not something I specifically know something about. I can make an extra export option (assuming not too complex, because it could be) would you have a better example of what you mean or some research on this?