oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.78k stars 5.33k forks source link

A possible 4x-5x faster performance increase in coqui_tts processing on low memory cards. #4712

Closed erew123 closed 11 months ago

erew123 commented 1 year ago

*EDIT - there's now a whole new build of this https://github.com/erew123/alltalk_tts

Description

When you are running a model that takes up all of your VRAM on your card, coqui_tts can be horribly slow (CPU is almost as fast in this scenario). I've found a potential way to massively speed it up, though, not being a coder, I'm not sure if or how/if it can be implemented, and I will admit I'm not 100% sure what's going on here as there are quirks to this.

TLDR: It looks as though there may be a massive performance benefit for low VRAM scenarios to re loading the coqui_tts engine for longer sentences/paragraphs. If other people are suffering/seeing the issue I am describing.

I suspect this wont benefit people who have plenty of VRAM spare (after loading their model), only those whom have a low VRAM scenario.

I have included some thoughts and strange observations about this at the bottom.

Additional Context

Here is a 13B model loaded into RTX 4070, which takes about 11.4GB to 11.7GB of VRAM. You can also see in the shared GPU Memory, the coqui_tts model is loaded, taking a couple of GB's of RAM.

image

When you generate TTS, with coqui_tts, layers of the AI model and the coqui_tts model are being swapped around and processed I believe. This results in a horrible time to generate the TTS. Minutes as opposed to seconds. I suspect, rather than loading the entire coqui_tts model back into the GPU as one large chunk, it may be nibbling away at loading the model back into GPU VRAM, processing a little, then nibbling away a bit more, processing etc rather than shifting out some of the AI models layers and moving the whole TTS model into VRAM.

As shown below, this is taking 250 seconds and 456 seconds to generate the TTS.

image

If I load the coqui_tts TTS engine in a another command prompt within the same environment, what it appears to do is boot a couple of GB's of the AI model out to shared GPU memory, load in the coqui_tts model fully to the card, then process your TTS. It takes maybe 10-20 seconds to load in the coqui_tts model, which is NOT displayed as part of the "Processing Time", so add 10-20 seconds on to that time.

The performance increase however, for doing this is, HUGE! Load time for the xtts2 model + processing is maybe 40 seconds total VS 250+ seconds! (Yes the AI model is still loaded into my VRAM and a bit moved to Shared Ram)

image

It looks like it loads in the TTS, booting out some of the AI models layers to Shared GPU RAM, then after processing the TTS, seemingly unloads the TTS model again (in this scenario, where I loaded it at another command prompt in the same environment).

image

When you go back and start using the chat, it shuffles the model layers back into your GPU and there is a 10-20% degradation of tokens per second (rough guess) but, its much better than waiting 2-3 minutes for your audio to be generated.

What I am suggesting is, for people who's model fills up their VRAM entirely, being able to reload the coqui_tts model for longer sentences, may provide a massive overall performance increase.

Thoughts/observations.

Here is the command line I used, should others wish to test. You will need to both load the correct python environment and also change the location to reflect your correct wav file location.

tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --text "Emma Watson, who portrayed Hermione Granger in the Harry Potter films, was born on April 15, 1990 in Paris, France. She began her acting career at a young age, appearing in television shows and movies before landing the role that would make her a household name. In addition to her work on Harry Potter, Watson has gone on to star in numerous other films, including The Perks of Being a Wallflower (2012) and Little Women (2019). She is also known for her activism and advocacy work, particularly regarding gender equality and women's rights." --speaker_wav c:\AI\Female.wav --language_idx en --use_cuda true

erew123 commented 1 year ago

@kanttouchthis Please see above. You may be interested in this.

berkut1 commented 1 year ago

Just to be sure, Did you do this https://github.com/oobabooga/text-generation-webui/discussions/4484 ? Thanks Nvidia, everything related to VRAM needs to be clarified whether System Fallback is disabled or not :)

erew123 commented 1 year ago

I have tested and will show that below, and it was worth a try/look but I suspected it wouldn't help.

I guess the short explanation of what's occurring is that the AI LLM model is having VRAM pre-allocated lets say 11GB out of 12GB (it never de-allocates the VRAM). The TTS engine/model needs 2.5GB's VRAM, but there is only 1GB VRAM free. So, rather than free up 2.5GB of VRAM and load the TTS engine/model fully into VRAM, its using chunks of the TTS engine/model and swapping them off disk/ram into the VRAM meaning it can only access bits of the TTS model at a time and has to keep swapping bits around. This slows down processing of course. Its further compounded when you get a long paragraph, because the TTS engine breaks that down into chunks/sentences for the TTS engine/model to generate TTS, meaning there are multiple passes of the TTS model occurring for long paragraphs/individual sentences.

What my method is doing, is spinning up a whole new process and loading in the TTS engine/model. This is forcing out some layers of AI model from VRAM, allowing the entire TTS engine/model to be loaded in, in one entire chunk, which speeds up processing massively, albeit at the cost of having to load in the model each time (10-15 second delay). I did try re-loading the TTS model/engine within the script.py, but that appears not to force the AI LLM layers out, it just re-loads the TTS model back into the same place as it was before.

CUDA - Sysmem Fallback Policy. Set the value to Prefer No Sysmem Fallback. 252 Seconds to generate 200 tokens worth of TTS image

Using the method I put in the earlier code, for Low Memory situations 60 Seconds to generate 200 tokens worth of TTS (assuming a 15 second load time for the TTS model) image

NB - This is only for situations where your AI LLM model is taking up virtually all your VRAM and there is less than 2.5GB's VRAM free in one chunk. If I load a 7B model vs a 13B model, I have plenty of VRAM free and can keep the TTS engine/model loaded into VRAM without issue.

Thanks for thanks for the suggestion though! Worth a try :)

berkut1 commented 1 year ago

@erew123 I think you didn't disabled correctly Sysmem Fallback. text-generation-webui uses Python from the folder oobabooga_windows\text-generation-webui\installer_files\env, but you are showing a different path :) Unless you installed the entire text-generation-webui into the AppData folder.

erew123 commented 1 year ago

@berkut1 Good Spot! Just tried it, forcing no system fallback. Sadly, it wont even load the extension now, so sadly it looks like that cant be used to handle the memory allocation in the way we want :/

image

berkut1 commented 1 year ago

@erew123 You can. Just allocate fewer GPU layouts to allow extension to be loaded and I probably sure that you'll get a 5x performance boost for tts without a bicycle ;)

erew123 commented 1 year ago

@berkut1 Do you mean load less layers into VRAM by specifying it in the loader (n-gpu-layers)? If so, ExLlama's dont have that capability, at least not through the interface.

erew123 commented 1 year ago

EDIT - a full new build is available https://github.com/erew123/alltalk_tts

I've re-written a decent chunk of this (scripts down below). New features:

  1. Start-up Checks
  1. Different Models/Methods You can now switch model types on the fly (Takes 10-15 seconds to change between them).
  1. Low VRAM option for people with less than 2GB free VRAM after loading the LLM. (Takes 10-15 seconds to change setting).
  1. Activate DeepSpeed (Takes 10-15 seconds to change setting) (Example screenshot down below).

Linux cd text-generation-webui ./cmd_linux.sh export CUDA_HOME=/usr/local/cuda python server.py

Windows cd text-generation-webui activate your 3.9.18 python environment (I will provide instructions at a later date on creating one) set CUDA_HOME=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1 (or 11.8) python server.py

NOTE: DeepSpeed on Windows has certain requirements. You can follow this link for instructions/details https://github.com/oobabooga/text-generation-webui/issues/4734 NOTE: DeepSpeed on Linux also requires you install libaio-dev NOTE: You can run ds_report when you have DeepSpeed on your system to see if it is working correctly. NOTE: DeepSpeed WILL throw up a load of errors if you have not installed it fully, correctly or installed the Nvidia Cuda Toolkit, and correctly set the CUDA_HOME environment variable. Loading text-generation-webui via its start-up scripts (cmd_xxxxx or start_xxxxx)will over-write the CUDA_HOME environment variable AND on Windows (only Windows) the start-up scripts load into newer version of Python than DeepSpeed for Windows supports. So on Windows, follow https://github.com/oobabooga/text-generation-webui/issues/4734 (It can be a complicated setup on Windows).

5) The command line is a bit more verbose now, as there's a lot more going on in the backend. As such, I included more output, to help people understand what's going on and debug any issues.

Other than that, it pretty much all works as it did before. I'm not claiming to be an amazing coder, so I'm sure some things could be tighter. I can think of a few other things that would be good to do later down the line with this.

Thanks erew123

Would also like to thank @oobabooga @kanttouchthis (original code) @daswer123 (VRAM<>RAM moving code) @Wuzzooy (for testing my scripts)

image image

These scripts can be copied and saved off into your /extensions/coquii_tts/ folder

script.py

import html
import json
import random
import time
import subprocess
import os
from pathlib import Path

import gradio as gr

from modules import chat, shared, ui_chat
from modules.logging_colors import logger
from modules.ui import create_refresh_button
from modules.utils import gradio

import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

try:
    from TTS.api import TTS
    from TTS.utils.synthesizer import Synthesizer
except ModuleNotFoundError:
    logger.error(
        "[COQUI TTS] Could not find the TTS module. Make sure to install the requirements for the coqui_tts extension."
        "\n"
        "\nLinux / Mac:\npip install -r extensions/coqui_tts/requirements.txt\n"
        "\nWindows:\npip install -r extensions\\coqui_tts\\requirements.txt\n"
        "\n"
        "[COQUI TTS] If you used the one-click installer, paste the command above in the terminal window launched after running the \"cmd_\" script. On Windows, that's \"cmd_windows.bat\"."
    )
    raise

try:
    import deepspeed
except:
    deepspeed_installed = False
    print("[COQUI TTS] DEEPSPEED: Not Detected. See https://github.com/microsoft/DeepSpeed") 
else: 
    deepspeed_installed = True
    print("[COQUI TTS] DEEPSPEED: Detected")
    print("[COQUI TTS] DEEPSPEED: Activate in Coqui settings")

params = {
    "activate": True,
    "autoplay": True,
    "show_text": True,
    "low_vram": False,
    "remove_trailing_dots": False,
    "voice": "female_01.wav",
    "language": "English",
    "model_name": "tts_models/multilingual/multi-dataset/xtts_v2", #This is the version called through the TTS.api
    "model_version": "xttsv2_2.0.2", #This is the model that is downloaded into your /coqui_tts/models/
    "deepspeed_activate": False,
    # Set the different methods for Generation of TTS
    "tts_method_api_tts": False,
    "tts_method_api_local": False,
    "tts_method_xtts_local": True,
    "model_loaded": False
}

#Clear model
model = None

# Device Setup
device = "cuda" if torch.cuda.is_available() else "cpu"

this_dir = Path(__file__).parent.resolve()

with open(this_dir / 'languages.json', encoding='utf8') as f:
    languages = json.load(f)

# Check for model having been downloaded
def check_required_files():
    this_dir = Path(__file__).parent.resolve()
    download_script_path = this_dir / 'modeldownload.py'
    subprocess.run(['python', str(download_script_path)])
    print("[COQUI TTS] STARTUP: All required files are present.")

# Call the function when your main script loads
check_required_files()

# Pick model loader depending on Params
def setup():
    global model
    generate_start_time = time.time()  # Record the start time of loading the model
    if params["tts_method_api_tts"]:
        print(f"[COQUI TTS] MODEL: \033[94mAPI TTS Loading\033[0m {params['model_name']} into\033[93m", device, "\033[0m")
        model = api_load_model()
    elif params["tts_method_api_local"]:
        print(f"[COQUI TTS] MODEL: \033[94mAPI Local Loading\033[0m {params['model_version']} into\033[93m", device, "\033[0m")
        model = api_manual_load_model()
    elif params["tts_method_xtts_local"]:
        print(f"[COQUI TTS] MODEL: \033[94mXTTSv2 Local Loading\033[0m {params['model_version']} into\033[93m", device, "\033[0m")
        model = xtts_manual_load_model()

    generate_end_time = time.time()  # Record the end time of loading the model
    generate_elapsed_time = generate_end_time - generate_start_time
    params["model_loaded"] = True
    print(f"[COQUI TTS] MODEL: \033[94mModel Loaded in \033[0m{generate_elapsed_time:.2f} seconds.")
    Path(f"{this_dir}/outputs").mkdir(parents=True, exist_ok=True)

#Model Loaders
def api_load_model():
    model = TTS(params["model_name"]).to(device)
    return model

def api_manual_load_model():
    model = TTS(model_path=this_dir / 'models' / params['model_version'],config_path=this_dir / 'models' / params['model_version'] / 'config.json').to(device)
    return model

def xtts_manual_load_model():
    config = XttsConfig()
    config_path = this_dir / 'models' / params['model_version'] / 'config.json'
    checkpoint_dir = this_dir / 'models' / params['model_version']
    config.load_json(str(config_path))
    model = Xtts.init_from_config(config)
    model.load_checkpoint(config, checkpoint_dir=str(checkpoint_dir), use_deepspeed=params['deepspeed_activate'])
    model.cuda()
    model.to(device)
    return model

#Unload or clear the model, and return None
def unload_model(model):
    del model
    params["model_loaded"] = False
    return None

#Move model between VRAM and RAM if Low VRAM set.
def switch_device():
    global model, device
    if not params["low_vram"]:
        return
    if device == "cuda":
        device = "cpu"
        model.to(device) 
        torch.cuda.empty_cache()
    else:    
        device == "cpu"
        device = "cuda"
        model.to(device)

#Display license information
print("[COQUI TTS] LICENSE: \033[94mCoqui Public Model License\033[0m")
print("[COQUI TTS] LICENSE: \033[94mhttps://coqui.ai/cpml.txt\033[0m")

def get_available_voices():
    return sorted([voice.name for voice in Path(f"{this_dir}/voices").glob("*.wav")])

def preprocess(raw_input):
    raw_input = html.unescape(raw_input)
    # raw_input = raw_input.strip("\"")
    return raw_input

def new_split_into_sentences(self, text):
    sentences = self.seg.segment(text)
    if params['remove_trailing_dots']:
        sentences_without_dots = []
        for sentence in sentences:
            if sentence.endswith('.') and not sentence.endswith('...'):
                sentence = sentence[:-1]

            sentences_without_dots.append(sentence)

        return sentences_without_dots
    else:
        return sentences

Synthesizer.split_into_sentences = new_split_into_sentences

def remove_tts_from_history(history):
    for i, entry in enumerate(history['internal']):
        history['visible'][i] = [history['visible'][i][0], entry[1]]

    return history

def toggle_text_in_history(history):
    for i, entry in enumerate(history['visible']):
        visible_reply = entry[1]
        if visible_reply.startswith('<audio'):
            if params['show_text']:
                reply = history['internal'][i][1]
                history['visible'][i] = [history['visible'][i][0], f"{visible_reply.split('</audio>')[0]}</audio>\n\n{reply}"]
            else:
                history['visible'][i] = [history['visible'][i][0], f"{visible_reply.split('</audio>')[0]}</audio>"]
    return history

def random_sentence():
    with open(Path("extensions/coqui_tts/harvard_sentences.txt")) as f:
        return random.choice(list(f))

#Preview Voice Generation Function
def voice_preview(string):
    #Check model is loaded before continuing
    if not params["model_loaded"]:
        print("[COQUI TTS] \033[91mWARNING\033[0m Model is still loading, please wait before trying to generate TTS")
        return
    string = html.unescape(string) or random_sentence()
    # Replace double quotes with single, asterisks, carriage returns, and line feeds
    string = string.replace('"', "'").replace(".'", "'.").replace('*', '').replace('\r', '').replace('\n', '')
    output_file = Path('extensions/coqui_tts/outputs/voice_preview.wav')
    if params["low_vram"] and device == "cpu":
        switch_device()
        print("[COQUI TTS] LOW VRAM: Moving model to:\033[93m", device, "\033[0m")

    #XTTSv2 LOCAL Method
    if params["tts_method_xtts_local"]:     
        generate_start_time = time.time()  # Record the start time of generating TTS
        print("[COQUI TTS] GENERATING TTS: {}".format(string))
        gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=[f"{this_dir}/voices/{params['voice']}"])
        out = model.inference(
        string, 
        languages[params["language"]],
        gpt_cond_latent=gpt_cond_latent,
        speaker_embedding=speaker_embedding,
        temperature=0.7
        )
        torchaudio.save(output_file, torch.tensor(out["wav"]).unsqueeze(0), 24000)
        generate_end_time = time.time()  # Record the end time to generate TTS
        generate_elapsed_time = generate_end_time - generate_start_time
        print(f"[COQUI TTS] PROCESSING TIME: \033[91m{generate_elapsed_time:.2f}\033[0m seconds.")
    #API TTS and API LOCAL Methods
    elif params["tts_method_api_tts"] or params["tts_method_api_local"]:
        #Set the correct output path (different from the if statement)
        model.tts_to_file(
            text=string,
            file_path=output_file,
            speaker_wav=[f"{this_dir}/voices/{params['voice']}"],
            language=languages[params["language"]]
        )       

    if params["low_vram"] and device == "cuda":
        switch_device()
        print("[COQUI TTS] LOW VRAM: Moving model to:\033[93m", device, "\033[0m")

    return f'<audio src="file/{output_file.as_posix()}?{int(time.time())}" controls autoplay></audio>'

def history_modifier(history):
    # Remove autoplay from the last reply
    if len(history['internal']) > 0:
        history['visible'][-1] = [
            history['visible'][-1][0],
            history['visible'][-1][1].replace('controls autoplay>', 'controls>')
        ]

    return history

def state_modifier(state):
    if not params['activate']:
        return state

    state['stream'] = False
    return state

def input_modifier(string, state):
    if not params['activate']:
        return string

    shared.processing_message = "*Is recording a voice message...*"
    return string

#Standard Voice Generation Function
def output_modifier(string, state):
    if not params["model_loaded"]:
        print("[COQUI TTS] \033[91mWARNING\033[0m Model is still loading, please wait before trying to generate TTS")
        return
    if params["low_vram"] and device == "cpu":
        switch_device()
        print("[COQUI TTS] LOW VRAM: Moving model to:\033[93m", device, "\033[0m")
    if not params['activate']: 
        return string

    original_string = string
    string = preprocess(html.unescape(string))

    if string == '':
        return '*Empty string*'

    # Replace double quotes with single, asterisks, carriage returns, and line feeds
    string = string.replace('"', "'").replace(".'", "'.").replace('*', '').replace('\r', '').replace('\n', '')  
    output_file = Path(f'extensions/coqui_tts/outputs/{state["character_menu"]}_{int(time.time())}.wav')

    #XTTSv2 LOCAL Method
    if params["tts_method_xtts_local"]:     
        generate_start_time = time.time()  # Record the start time of generating TTS
        print("[COQUI TTS] GENERATING TTS: {}".format(string))
        gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=[f"{this_dir}/voices/{params['voice']}"])
        out = model.inference(
        string, 
        languages[params["language"]],
        gpt_cond_latent=gpt_cond_latent,
        speaker_embedding=speaker_embedding,
        temperature=0.7
        )
        torchaudio.save(output_file, torch.tensor(out["wav"]).unsqueeze(0), 24000)
        generate_end_time = time.time()  # Record the end time to generate TTS
        generate_elapsed_time = generate_end_time - generate_start_time
        print(f"[COQUI TTS] PROCESSING TIME: \033[91m{generate_elapsed_time:.2f}\033[0m seconds.") 

        autoplay = 'autoplay' if params['autoplay'] else ''
        string = f'<audio src="file/{output_file.as_posix()}" controls {autoplay}></audio>'

        if params['show_text']:
            string += f'\n\n{original_string}'

        shared.processing_message = "*Is typing...*"

    #API TTS and API LOCAL Methods    
    elif params["tts_method_api_tts"] or params["tts_method_api_local"]:
        #Set the correct output path (different from the if statement)
        model.tts_to_file(
            text=string,
            file_path=output_file,
            speaker_wav=[f"{this_dir}/voices/{params['voice']}"],
            language=languages[params["language"]]
        )       

        autoplay = 'autoplay' if params['autoplay'] else ''
        string = f'<audio src="file/{output_file.as_posix()}" controls {autoplay}></audio>'

        if params['show_text']:
            string += f'\n\n{original_string}'

        shared.processing_message = "*Is typing...*"

    if params["low_vram"] and device == "cuda":
        switch_device()
        print("[COQUI TTS] LOW VRAM: Moving model to:\033[93m", device, "\033[0m")

    return string

def custom_css():
    path_to_css = Path(f"{this_dir}/style.css")
    return open(path_to_css, 'r').read()

#Low VRAM Gradio Checkbox handling
def handle_low_vram(value):
   global model, device
   if value:
       model = unload_model(model)
       device = "cpu"
       print("[COQUI TTS] MODEL: \033[94mChanging model \033[92m(Please wait 15 seconds)\033[0m")
       print("[COQUI TTS] LOW VRAM: \033[94mEnabled.\033[0m Model will move between \033[93mVRAM(cuda) <> System RAM(cpu)\033[0m")
       setup() 
   else:
       model = unload_model(model)
       device = "cuda"
       print("[COQUI TTS] MODEL: \033[94mChanging model \033[92m(Please wait 15 seconds)\033[0m")
       print("[COQUI TTS] LOW VRAM: \033[94mDisabled.\033[0m Model will remain in \033[93mVRAM\033[0m")
       setup()

#Reload the model when DeepSpeed checkbox is enabled/disabled
def handle_deepspeed_activate_checkbox_change(value):
    global model

    if value:
        # DeepSpeed enabled
        print("[COQUI TTS] DEEPSPEED: \033[93mActivating)\033[0m")
        print("[COQUI TTS] MODEL: \033[94mChanging model \033[92m(Please wait 15 seconds)\033[0m")
        model = unload_model(model)
        params["tts_method_api_tts"] = False
        params["tts_method_api_local"] = False
        params["tts_method_xtts_local"] = True
        params["deepspeed_activate"] = True
        gr.update(tts_radio_buttons={"value": "XTTSv2 Local"})
        setup()
    else:
        # DeepSpeed disabled
        print("[COQUI TTS] DEEPSPEED: \033[93mDe-Activating)\033[0m")
        print("[COQUI TTS] MODEL: \033[94mChanging model \033[92m(Please wait 15 seconds)\033[0m")
        params["deepspeed_activate"] = False 
        model = unload_model(model)
        setup()

    return value # Return new checkbox value

# Allow DeepSpeed Checkbox to appear if tts_method_api_tts and deepspeed_installed are True
deepspeed_condition = params["tts_method_xtts_local"] == "True" and deepspeed_installed

def handle_tts_method_change(choice):
    # Update the params dictionary based on the selected radio button
    print("[COQUI TTS] MODEL: \033[94mChanging model \033[92m(Please wait 15 seconds)\033[0m")

    # Set other parameters to False
    if choice == "API TTS":
        params["tts_method_api_local"] = False
        params["tts_method_xtts_local"] = False
        params["tts_method_api_tts"] = True
        params["deepspeed_activate"] = False
        gr.update(deepspeed_checkbox={"value": False})
    elif choice == "API Local":
        params["tts_method_api_tts"] = False
        params["tts_method_xtts_local"] = False
        params["tts_method_api_local"] = True
        params["deepspeed_activate"] = False
        gr.update(deepspeed_checkbox={"value": False})
    elif choice == "XTTSv2 Local":
        params["tts_method_api_tts"] = False
        params["tts_method_api_local"] = False
        params["tts_method_xtts_local"] = True

    # Unload the current model
    global model
    model = unload_model(model)

    # Load the correct model based on the updated params
    setup()

def ui():
    with gr.Accordion("Coqui TTS (XTTSv2)"):
        with gr.Row():
            activate = gr.Checkbox(value=params['activate'], label='Activate TTS')
            autoplay = gr.Checkbox(value=params['autoplay'], label='Play TTS automatically')

        with gr.Row():
            show_text = gr.Checkbox(value=params['show_text'], label='Show message text under audio player')
            remove_trailing_dots = gr.Checkbox(value=params['remove_trailing_dots'], label='Remove trailing "." from text segments before generation')

        with gr.Row():
            low_vram = gr.Checkbox(value=params['low_vram'], label='Low VRAM mode (Read NOTE)')
            deepspeed_checkbox = gr.Checkbox(value=params['deepspeed_activate'], label='Activate DeepSpeed (Read NOTE)', visible=deepspeed_installed)

        with gr.Row():
            tts_radio_buttons = gr.Radio(
            choices=["API TTS", "API Local", "XTTSv2 Local"],
            label="Select TTS Generation Method (Read NOTE)",
            value="XTTSv2 Local"  # Set the default value
            )

            explanation_text = gr.HTML("<p>NOTE: Switching Model Type, Low VRAM & DeepSpeed takes 15 seconds. Each TTS generation method has a slightly different sound. DeepSpeed checkbox is only visible if DeepSpeed is present on your system and it only uses XTTSv2 Local.</p>")

        with gr.Row():
            with gr.Row():
                voice = gr.Dropdown(get_available_voices(), label="Voice wav", value=params["voice"])
                create_refresh_button(voice, lambda: None, lambda: {'choices': get_available_voices(), 'value': params["voice"]}, 'refresh-button')

            language = gr.Dropdown(languages.keys(), label="Language", value=params["language"])

        with gr.Row():
            preview_text = gr.Text(show_label=False, placeholder="Preview text", elem_id="silero_preview_text")
            preview_play = gr.Button("Preview")
            preview_audio = gr.HTML(visible=False)

        with gr.Row():
            convert = gr.Button('Permanently replace audios with the message texts')
            convert_cancel = gr.Button('Cancel', visible=False)
            convert_confirm = gr.Button('Confirm (cannot be undone)', variant="stop", visible=False)

    # Convert history with confirmation
    convert_arr = [convert_confirm, convert, convert_cancel]
    convert.click(lambda: [gr.update(visible=True), gr.update(visible=False), gr.update(visible=True)], None, convert_arr)
    convert_confirm.click(
        lambda: [gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)], None, convert_arr).then(
        remove_tts_from_history, gradio('history'), gradio('history')).then(
        chat.save_history, gradio('history', 'unique_id', 'character_menu', 'mode'), None).then(
        chat.redraw_html, gradio(ui_chat.reload_arr), gradio('display'))

    convert_cancel.click(lambda: [gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)], None, convert_arr)

    # Toggle message text in history
    show_text.change(
        lambda x: params.update({"show_text": x}), show_text, None).then(
        toggle_text_in_history, gradio('history'), gradio('history')).then(
        chat.save_history, gradio('history', 'unique_id', 'character_menu', 'mode'), None).then(
        chat.redraw_html, gradio(ui_chat.reload_arr), gradio('display'))

    # Event functions to update the parameters in the backend
    activate.change(lambda x: params.update({"activate": x}), activate, None)
    autoplay.change(lambda x: params.update({"autoplay": x}), autoplay, None)
    low_vram.change(lambda x: params.update({"low_vram": x}), low_vram, None)
    low_vram.change(handle_low_vram, low_vram, None)
    tts_radio_buttons.change(handle_tts_method_change, tts_radio_buttons, None)
    deepspeed_checkbox.change(handle_deepspeed_activate_checkbox_change, deepspeed_checkbox, None)
    remove_trailing_dots.change(lambda x: params.update({"remove_trailing_dots": x}), remove_trailing_dots, None)
    voice.change(lambda x: params.update({"voice": x}), voice, None)
    language.change(lambda x: params.update({"language": x}), language, None)

    # Play preview
    preview_text.submit(voice_preview, preview_text, preview_audio)
    preview_play.click(voice_preview, preview_text, preview_audio)

modeldownload.py

import os
from pathlib import Path
import requests
from tqdm import tqdm
import importlib.metadata as metadata  # Use importlib.metadata
from packaging import version

def create_directory_if_not_exists(directory):
    if not directory.exists():
        directory.mkdir(parents=True)

def download_file(url, destination):
    response = requests.get(url, stream=True)
    total_size_in_bytes = int(response.headers.get('content-length', 0))
    block_size = 1024  # 1 Kibibyte

    progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True)

    with open(destination, 'wb') as file:
        for data in response.iter_content(block_size):
            progress_bar.update(len(data))
            file.write(data)

    progress_bar.close()

def check_tts_version():
    try:
        tts_version = metadata.version("tts")
        print(f"[COQUI TTS] STARTUP: TTS version: {tts_version}")

        if version.parse(tts_version) < version.parse("0.21.1"):
            print("[COQUI TTS] STARTUP: \033[91mTTS version is too old. Please upgrade to version 0.21.1 or later.\033[0m")
            print("[COQUI TTS] STARTUP: \033[91mpip install --upgrade tts\033[0m")
        else:
            print("[COQUI TTS] STARTUP: TTS version is up to date.")
    except metadata.PackageNotFoundError:
        print("[COQUI TTS] STARTUP: TTS is not installed.")

# Use this_dir in the downloader script
this_dir = Path(__file__).parent.resolve()

# Define paths
base_path = this_dir / 'models'
model_path = base_path / 'xttsv2_2.0.2'

# Check and create directories
create_directory_if_not_exists(base_path)
create_directory_if_not_exists(model_path)

# Define files and their corresponding URLs
files_to_download = {
    'LICENSE.txt': 'https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/LICENSE.txt?download=true',
    'README.md': 'https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/README.md?download=true',
    'config.json': 'https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/config.json?download=true',
    'model.pth': 'https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/model.pth?download=true',
    'vocab.json': 'https://huggingface.co/coqui/XTTS-v2/resolve/v2.0.2/vocab.json?download=true',
}

# Download files if they don't exist
print("[COQUI TTS] STARTUP: Checking Model is Downloaded.")
for filename, url in files_to_download.items():
    destination = model_path / filename
    if not destination.exists():
        print(f"[COQUI TTS] STARTUP: Downloading {filename}...")
        download_file(url, destination)

check_tts_version()
Wuzzooy commented 12 months ago

Thank you for your contribution, do you know if there is a way to use deepspeed setting with the api module version ? In their doc they show it for when you load the xtts module directly but not the api one. And yes DeepSpeed can be a pain to install but would still be cool to be able to use it if we managed to install it because the way we use coqui is not that optimal for text streaming and so the waiting after the text output is not that great even with the model fully loaded in vram.

erew123 commented 12 months ago

@Wuzzooy Hah, well, I was just kind of having a look at that myself yesterday when they updated TTS and the model everything went a bit funky.

I too can only find the one reference to loading/using a model with DeepSpeed https://tts.readthedocs.io/en/latest/models/xtts.html#model-directly

In the code above that I dropped in, enabling DeepSpeed for a loaded model should be easy to do, bar the actual getting DeepSpeed installed on your system. Though there are ways to make it easier for Windows installs (never tried but I believe its very easy on Linux). And DeepSpeed on Windows does work+speed up TTS (same for Linux).

The TTS executable doesn't appear to have any option to use DeepSpeed, only the Python code. But I have a theory on how to make both my Low VRAM solution above and the "model pre loaded" method both work and use DeepSpeed, whilst still performing their current behaviour. But Ill need to do some head scratching and a bit of research.

erew123 commented 12 months ago

@Wuzzooy Ive managed to implement DeepSpeed in my lowvram script (Ive just updated them above). Ive NOT yet tested it actually works, as I havnt installed DeepSpeed on my system yet......and Ive not had time to check these scripts on Linux yet either, but they should work.

The lowvram script doesnt keep the model in VRAM, but loads it every time needed, so, you have a 10-15 second delay when using that script anyway, before is starts processing. I also made it download the old TTS model, as the new one isnt very good. In theory though, if I get a bit further down the line, I might (or hopefully one of the devs) will have it working in both scripts. At least theres a potential way of testing it now...... a proof of concept!

Wuzzooy commented 12 months ago

Thank you, i will need to do another ooba install to install deepspeed and try your script because while i could make it work on windows with coqui standalone, i had to use python 3.9 because with the wheel for 3.10 it was failing with a dll error when trying to generate audio and i had an error trying to compile for 3.11. I've seen your tutorial to install it but if anyone needs i got it working with the wheel available for python 3.9 here. https://huggingface.co/Jmica/audiobook_maker/tree/main It's for cuda 11.8 though pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118

erew123 commented 12 months ago

I've given it a shot with CUDA 12.3 (which I think is too high) and a new Python env I built on 3.10 and its trying, but its throwing errors at me!

FYI, if you are thinking of using my code/scripts up above, only when you use the Low VRAM setting will DeepSpeed try to run. Ive not yet done anything with the main script, as I thought I might as well see if I could get it to work on Windows..... and well, getting DeepSpeed to work on Windows, that was 4-5 hours of my life gone!

If you use the my lowvram script, it WILL download a 2GB TTS model, sub the folder too. Only the one time will it download, but just letting you know!

Im probably not going to be looking at this further for a few hours as, its well, its late here and beds calling!

erew123 commented 12 months ago

Just to add (before I go), Text-Gen-WebUI appears to be installing at Python 3.11.5 and I've set mine up with CUDA 12.1 (though you get the option at install time of 11 or 12).

I guess I may try building some wheels for those versions tomorrow and see if one of them works correctly. The one Ive built for now, its complaining like crazy, but it does at least see DeepSpeed is there now on Windows.

Ill also test it on my Linux system... One thing at once I guess!

Wuzzooy commented 12 months ago

Well i installed oobabooga with python 3.9 and managed to install that deepspeed wheel for python 3.9 cuda 11.8 and it worked. I'm getting an error with openai though, the same reported here. https://github.com/dusty-nv/jetson-containers/issues/338

Anyway, i can't properly compare the inference speed with the low vram thing but it seems to have worked though. That said i have the same issue that i have with the coqui stand alone when not using the api module, the 250 characters limit pop for just 3 sentences and the audio get weird at the end. DeepSpeedooba Need to be able to use it with their api module but maybe there is a reason if they don't provide the option for it :D

erew123 commented 12 months ago

@Wuzzooy I've updated the lowvram.py script to now give you "model load time, tts generation time, total script time".

image

force deepspeed on or off in lowvram.py (for now) by editing:

model.load_checkpoint(config, checkpoint_dir=str(checkpoint_dir), use_deepspeed=deepspeed_enabled)

There's no need restart the interface, as it loads this file each time you tell it to generate. Changing deepspeed_enabled to True or False. It will still say it detects/uses DeepSpeed, but if you set that to False, it wont use it.

Test it with the same paragraph in both scenarios (use the preview in the web interface) and see what the output time is like!

Wuzzooy commented 12 months ago

Thank you did the comparison.

dscomparison

erew123 commented 12 months ago

So its made it nearly as fast as the method that keeps it in memory all the time!! Wow! That is a speedup.

I'm still looking into DeepSpeed things. Once I've figured out some more bits, Ill see if its possible to have the other method, where it keeps the model in memory all the time, using DeepSpeed as an option as well! That would give you the 6 second processing time with a "always loaded TTS" model, based on that above!

erew123 commented 12 months ago

@Wuzzooy If you're interested, I've managed to get DeepSpeed compiling on Windows. Turns out v8.3 of DeepSpeed is the most recent build that will compile for Windows, on Python 3.9.18. Later versions cannot be compiled at all for Windows.

I've not managed to test them both out yet, but if you want them at all, they are here https://easyupload.io/235jwk

There's both CUDA 11.8 and 12.1 (as I say, I've not managed to test them all yet).

Wuzzooy commented 12 months ago

"Turns out v8.3 of DeepSpeed is the most recent build that will compile for Windows, on Python 3.9.18. Later versions cannot be compiled at all for Windows." Thank you, it's what i guessed but couldn't be sure if the issue was on my side, i wasted so much time on this trying to compile with their recent version and python 3.10/11.

erew123 commented 12 months ago

@Wuzzooy have a look here https://github.com/microsoft/DeepSpeed/issues/ the currently 1st issue... I've had a bit of a go at Microsoft about how messy/bad this is.

erew123 commented 12 months ago

@Wuzzooy Ive just moved my script over to a Linux machine and setup DeepSpeed on that with CUDA 12.1. I'm obviously forcing it to use the old 2.0.2 XTTS model. DeepSpeed is doing its thing and so Im happy to start taking a look at the main script now and see if I can get that to work, however, Im getting some very strange drop off on voice generation on Linux. I cant decide if its as a result of DeepSpeed, maybe CUDA 12.1, or just something thats a Linux thing.

So I just wanted to ask you, have you been getting a lot of strange voice generation, specifically on longer paragraphs, specifically when using DeepSpeed? (Maybe 40 words or more?)

Wuzzooy commented 12 months ago

Yes i did, it's what i was talking about in one of my previous post but i don't think it's related to DeepSpeed but the way we use the model and XTTS limit, i had the same thing when trying coqui on standalone version via their demo code not using api module. The api module seems to handle the generation differently. To make sure you could just load the model the same way you do with the low vram thing but with DeepSpeed to False.

Edit : just tested with your lowvram script but with DeepSpeed to false and i get the audio glitch too on +20 sec audio clip. And i know DeepSpeed was really off because no DeepSpeed verbose after the generation and the processing time was similar to what i get with the original script. Also you should see that message on the Terminal "[!] Warning: The text length exceeds the character limit of 250 for language 'en', this might cause truncated audio."

Apparently, we are not splitting the text into sentence when using the code that load and use the model directly so we are sending the whole text at once. The api module with its synthetizer module split the text into sentence and so doesn't have this issue with 30sec audio ? Not sure how it works exactly so some parts of what i'm saying could be wrong so forgive me for that :D If it's the issue, maybe there is a way to use the model module the same way the api module does or maybe there is a way to use DeepSpeed with the api module.

Sascha353 commented 12 months ago

Apparently, we are not splitting the text into sentence when using the code that load and use the model directly so we are sending the whole text at once. The api module with its synthetizer module split the text into sentence and so doesn't have this issue with 30sec audio ? Not sure how it works exactly so some parts of what i'm saying could be wrong so forgive me for that :D If it's the issue, maybe there is a way to use the model module the same way the api module does or maybe there is a way to use DeepSpeed with the api module.

I created a feature request for a more "stream like" experience by splitting the text into chunks: https://github.com/oobabooga/text-generation-webui/issues/4706. This could be a workaround for the mentioned issue (similar for silero here: https://github.com/oobabooga/text-generation-webui/issues/3653) and in addition makes the response time much faster.

erew123 commented 12 months ago

@Sascha353 I did kind of test this a little bit with some of the code I've been doing, but not implemented anything yet. You are right though. You can generate about 10 seconds worth of speech and by the time that's played, have generated another 10-15 seconds of speech (if the model is loaded into VRAM and this is without DeepSpeed enabled, so it could be even faster).

FYI, I've tested/testing DeepSpeed and its about a 3-4x speed increase on generation, though its a complicated beast on Windows a bit easier on Linux. And before anyone decides to just install DeepSpeed, the scripts have to be written to support it.

So, I'm just making a mental note here to look into this properly.

EDIT - Managed to get DeepSpeed working in the current script 28th Nov 2023

daswer123 commented 12 months ago

@erew123 Look, I'm not very good at torch. but what do you think about storing the model in RAM and moving it to VRAM and then back to RAM after conversion?

I implemented in xtts-api-server and the difference is not very big, about 1 second difference between constant moving and keeping in VRAM all the time https://github.com/daswer123/xtts-api-server/commit/90861c5764ecc08579b536d7d77a2551e287f702

https://github.com/oobabooga/text-generation-webui/assets/22278673/dad4eed4-4d10-42e7-bf88-a6a6aea8d7a8

erew123 commented 12 months ago

@daswer123 I love the sound of that and thanks for the demo. Let me run up SillyTavern, get your XTTS Server running and Ill run some tests to see if it is doing the bit where it kicks a few layers out the LLM model out of RAM, as that's the key to the whole thing!

If it does work that way, then you've just found a great solution! :) Though I may have to ask you for a little help implementing it (if you are able to. If not I'm sure ill figure it...eventually). The code I currently have is a bit of a jump beyond what I posted up on here, so there's no point working on that one (I will update that soon).

So let me check it works, and if it does, ill confirm back to you.

daswer123 commented 12 months ago

Of course I have already updated, you can use, enough to add the --lowvram flag

erew123 commented 12 months ago

@daswer123 That works beautifully with Silly tavern and Text-generation-webui. It handles the VRAM situation perfectly, and, as you say, moving the model to RAM is super fast! One thing I did notice, when it splits into sentences, you can get this situation where a ' ends up being a sentence to generate, which causes it to create a strange sound at the end of a message.

image

What I did in my code, was the below, to swap:

" to ' so that "this is speech" becomes 'this is speech'

.' to '. this ensures that any issues with 'This is now a sentence that has a full stop and quote after it.' which seems to cause the "'This is now a sentence that has a full stop and quote after it.", "'" issue, now becomes "'This is now a sentence that has a full stop and quote BEFORE it'." (the quotation mark is moved in front of the full stop, so doesn't try to generate "'" as TTS.

* to nothing as this also seemed to sometimes cause strange audio issues.

string = string.replace('"', "'").replace(".'", "'.").replace('*', '')

I'm sure there is a better method for that! And I don't know how that interacts with the sentence splitting, but it seemed to clear up the strange sounds issue!

As for your lowvram method! The only one thing I'm wondering, is will it show the same behaviour when its running in the same python process as the LLM models are loading into? Just loading it the normal/original way, the problem was that the LLM loader wouldnt release any memory at all, I assumed, because it was in the same Python process, hence I started loading it in another python process "lowvram.py".

@daswer123 I don't suppose you know if the code you generated will or wont force the VRAM of other things that are running the same python process to be forced out of VRAM (the LLM)? Or should I just try getting the code working and see what happens?

Thanks for your help+time!

daswer123 commented 12 months ago

About the sentences, I took a piece of your code and in a recent update started using regular expressions, I didn't seem to notice such a problem. example

I think there will be no problem with other processes that use VRAM since we are only working with self.model

Wuzzooy commented 12 months ago

Yeah streaming in chunks would be ideal. Watch for realtimeTTS lib to see how fast it is when it's streamed. https://github.com/KoljaB/RealtimeTTS

But is it possible to do that with just the script from the extension ?

erew123 commented 12 months ago

@daswer123 I've updated my script up above and its now working in all areas, though I have disabled the Low VRAM option for now, but left all the variables, interface bits etc for Low VRAM. I was just starting to take a look at the TTS wrapper, I sort of get the idea, but its getting late and I'm tired. You're welcome to update my script if its simple or you just fancy the challenge.

Also, you may like to take a look at the DeepSpeed implementation in my script (if you don't have it in yours)! (Image down below of the performance gains). (If you wanted to get DeepSpeed working on your Windows machine, follow this link https://github.com/oobabooga/text-generation-webui/issues/4734 and the link inside it). FYI, its only Python 3.9.18 or lower on Windows, but very easy to setup on Linux and any version of Python on there, though on both OS's, you need to set the CUDA_HOME environment path in your Python Environment, after loading the Nvidia CUDA Toolkit. Linux is as simple as pip install deepspeed and installing libao dev (or something like that). You can run ds_report after pip installing DeepSpeed and it tells you there what file is missing (as long as you have your CUDA_Home set correctly).

@Wuzzooy As I mentioned above, I've updated both the script.py and modeldownload.py. The Low VRAM option is disabled, but, you can now pick between the 3x types of models (it will reload on the fly) AND.......drum roll... DeepSpeed is integrated now (You will see the checkbox if it can detect DeepSpeed on your machine). If you click the DeepSpeed checkbox, it will unload your model and load in the XTTSv2 locally stored model (I believe that's the only one that you can use with DeepSpeed). I tested it on my Linux system and its maybe 2-3 times faster with DeepSpeed enabled. (Just take a backup of your current files, in case I uploaded something wrong in my above code, being late at night, and at least you will have a fall back if needed).

image

image

Wuzzooy commented 12 months ago

@erew123 Thank you very much for your work, i will try it.

daswer123 commented 12 months ago

Yeah streaming in chunks would be ideal. Watch for realtimeTTS lib to see how fast it is when it's streamed. https://github.com/KoljaB/RealtimeTTS

Guys, this is just crazy fast, I was able to attach this to sillyTavern and it's awesome!

https://github.com/oobabooga/text-generation-webui/assets/22278673/e0f6cfdb-339f-494f-a410-3b5ab1bb1a84

If interested you can try it, there are some limitations but they are not serious

https://github.com/daswer123/xtts-api-server/pull/10

erew123 commented 12 months ago

@daswer123 That looks good! Does it actually use extra VRAM? as they mention here https://github.com/KoljaB/RealtimeTTS#coquiengine

If so, I guess its not going to be compatible with the lowvram solution, but it probably should be an option for those who can afford the memory.

erew123 commented 12 months ago

@daswer123 @Wuzzooy I've completed everything and the new script is up above https://github.com/oobabooga/text-generation-webui/issues/4712#issuecomment-1825593734

@daswer123 Took me a while to dig through your code to figure what it was doing with the cuda moves etc and how to integrate that into my script, but I've got it working now. Thanks for the help with that and suggesting it. It works great!

daswer123 commented 12 months ago

As for streaming, there you need to keep the model in memory all the time, and the usage is more

I'm glad that my code and suggestions helped you to improve your code, as some of your code helped me to improve my project :)

Great job

oobabooga commented 12 months ago

@erew123 could you submit a PR to the repository with your changes? They look good and I would be happy to test them.

erew123 commented 12 months ago

@oobabooga Ill try to figure it. I'm not too GitHub savvy! Aka, never done a PR in my life or even forked a repository.

erew123 commented 11 months ago

closing this off as I have fully implemented everything in the current PR that is waiting to be approved.

erew123 commented 11 months ago

@Wuzzooy If you're interested, this has been fully written into its own extension, with lots of other interesting bits built in https://github.com/erew123/alltalk_tts

erew123 commented 11 months ago

@Wuzzooy changed the download link to here https://filebin.net/t97nd69ac7qm2rsf (the voice files)

Wuzzooy commented 11 months ago

@Wuzzooy changed the download link to here https://filebin.net/t97nd69ac7qm2rsf (the voice files)

Can't thank you enough for your massive work, it's impressive to me how you evolved so quickly.

erew123 commented 11 months ago

@Wuzzooy You're welcome! I hope its all working well for you! Did you check the bottom of the settings page? There's a thanks/nod to you in there :) I hope that was ok?

I might also bump a few more speech engine options in there sometime too.... I'm still mulling on that one. Something like on the settings page, you can choose another TTS engine, it will download the model for it, and swap you over to the other engine. (I dont want to just download 3 different models/engines on start-up. I guess that will be v2 if I do it.

Wuzzooy commented 11 months ago

Yes it works well, It is really a complete extension and no i didn't even see the shout-out, thank you for that and yeah it's okay.

erew123 commented 11 months ago

@Wuzzooy Glad its working well... though you need to update again! I've just dropped an update and it improves the voice reproduction and it shouldn't mis-speak words now!