Low performance issue/question

jepjoo commented 10 months ago

I'm seeing an example of 29s of audio rendered in ~3s, so about a 10:1 ratio on a 4090 here:

https://github.com/RandomInternetPreson/text_generation_webui_xtt_Alts/tree/main#example

But on my 4090 (+Ryzen 7600X) Win 11 system I'm seeing more like a 3:1 ratio.

GPU usage is at 25-40% during audio rendering, drawing ~70W.
Lots of VRAM free
CPU usage isn't super heavy during audio rendering
Textgen is up to date and CUDA version 12.1 if that matters

Any ideas what's bottlenecking me? And anyone else seeing worse than expected performance?

kanttouchthis commented 10 months ago

Someone mentioned that the generation speed depended on the sampling rate and number of channels in the reference audio. Try resampling your audio to 24000hz mono and see if that changes anything. The model samples at 24k mono anyways so there shouldn't be any difference in quality

jepjoo commented 10 months ago

Should have mentioned, I mostly tested with the included example.wav which seems to be 22khz mono. Poor performance with that too.

dillonroach commented 10 months ago

@kanttouchthis you're asking the tts to do a lot of extra stuff it doesn't need to every time by making the call via tts.tts_to_file()

Here's a short-hand reference implementation of what I have locally that usually is 1-2sec per 20 sec of audio on a 3090:

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import numpy
import nltk

config = XttsConfig()

config.load_json("<path to xtts2 config>")
config.temperature = 0.65
config.decoder_sampler = 'dpm++2m'
config.cond_free_k = 7
config.decoder_iterations = 256
config.num_gpt_outputs = 512

model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="<path-to-model-folder>", use_deepspeed=True) #deepspeed isn't required
model.cuda()
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=<path-to-ref-wav>)
#make the latent and embeds lists and load in multiple times for different characters

Run all that stuff outside of the usual loop in chat - that's init level stuff and should just be run once and stashed. Then on actual call:

def run_voice(chat_interface_text,character):
    global voice #my local object that plays audio - but you can keep the save/load to file
    out = []
    sentences = nltk.sent_tokenize(chat_interface.value[-1].value.replace("</s>",""))
    silence = numpy.zeros(int(0.35 * 24000)) 
    for i, j in enumerate(sentences):
        out.append( model.inference(
            j,
            "en",
            gpt_cond_latent[character], #example passing in different latents and embeds for different chars
            speaker_embedding[character],
            temperature=0.7, # Add custom parameters here
            )
          )
    stitched = numpy.concatenate([numpy.append(i['wav'], silence) for i in out])
    voice.object = numpy.int16(numpy.array(stitched,dtype=numpy.float32)* 32767)
    #to write numpy obj to disk instead, run: torchaudio.save('file.wav', torch.tensor(voice.object).unsqueeze(0), 24000)
    return

RandomInternetPreson commented 10 months ago

Two things that might speed up your inferencing and voice outputs:

If using Windows, enable Hardware-accelerated GPU scheduling, this is a setting in the "graphics settings" just turn it on and restart your computer.
Let your computer boot all the way and log in, then restart your computer once again and log into your system BIOS and enable "resize bar" this also helps reduce latency.

jepjoo commented 10 months ago

Two things that might speed up your inferencing and voice outputs:

1. If using Windows, enable Hardware-accelerated GPU scheduling, this is a setting in the "graphics settings" just turn it on and restart your computer.

2. Let your computer boot all the way and log in, then restart your computer once again and log into your system BIOS and enable "resize bar" this also helps reduce latency.

Thanks for the tips!

Turning off HW accelerated GPU scheduling lowered "Real-time factor" from about 0.37-0.4 to 0.28-0.3. That's a pretty decent boost. I could also observe an increase in GPU usage during audio rendering.

erew123 commented 10 months ago

I have another observation, though Im going to open another ticket about it as a feature request. Ill try keep my explanation simple for here though.

I have a 12GB card and loading a 13B model in that card uses 11.7GB of the VRAM, so only 300MB VRAM is left.

My AI text generation is nice and fast 20 tokens a second. However, when it goes to process the audio, its clearly swapping the TTS model into the graphics VRAM, perhaps in chunks as you dont see any major memory changes. So to process say 4x lines of text with this setup can take 60 seconds.

If I load a 7B model, which only takes about 8.5GB of my VRAM, because I now have 3.5GB of VRAM free, the TTS model can easily load into the VRAM without issue, in one nice lump. Generating the Audio output, now drops between 9 to 20 seconds! Which is fantastic...... though I'm now using a less powerful model!

I tried editing the script.py and changing the references to "cuda" to "cpu" which loads the TTS into your system ram and processes it on you CPU, not your GPU. In my case I have an 8 core 16 thread CPU.

Is CPU rendering faster when I'm using a 13B model and short on VRAM? Yes, just about, I think..... processing on my CPU in that situation may just be a bit faster, perhaps 10-15%. Obviously its NOT faster than processing when I am using a 7B model and my GPU with 3.5GB of VRAM spare.

I'm thinking you need about 1.5GB of VRAM to fit the TTS in, maybe closer to 2GB to do it comfortably.

So, it may be faster to use your CPU in some instances, depending on how much VRAM you have left after you have loaded your model and depending on how fast your CPU is.

At this point in time, you can try on your own system and experiment, but I guess what Im saying is "If you dont have much VRAM left on your card, after loading your model, expect a slower processing time for Audio".

If you edit the text-generation-webui\extensions\text_generation_webui_xtts\script.py file, to change the 3x "cuda" to "cpu" in there, you DO have to reload Text-Gen-WebUI. (unloading and reloading on the session tab may work, not tried it).

Wuzzooy commented 10 months ago

Me neither i don't have the performance reported by RandomInternetPreson but when i use the realtimeTTS version installed in the same environment which use coqui engine i can generate few sentences in one sec but the realtimeTTS version is not recording any file and is not an extension integrated in ooba. It proves though that my setup can do it so i don't really understand what is happening. I only have a 4070 ti but i did these tests with a 7b q4 model and have 3.5 gb of vram left when the xtts model + the LLM are loaded.

rltts

It takes me 20 sec to generate 30 sec of audio on ooba with XTTS

kanttouchthis commented 10 months ago

@kanttouchthis you're asking the tts to do a lot of extra stuff it doesn't need to every time by making the call via tts.tts_to_file()

Here's a short-hand reference implementation of what I have locally that usually is 1-2sec per 20 sec of audio on a 3090:

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import numpy
import nltk

config = XttsConfig()

config.load_json("<path to xtts2 config>")
config.temperature = 0.65
config.decoder_sampler = 'dpm++2m'
config.cond_free_k = 7
config.decoder_iterations = 256
config.num_gpt_outputs = 512

model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="<path-to-model-folder>", use_deepspeed=True) #deepspeed isn't required
model.cuda()
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=<path-to-ref-wav>)
#make the latent and embeds lists and load in multiple times for different characters

Run all that stuff outside of the usual loop in chat - that's init level stuff and should just be run once and stashed. Then on actual call:

def run_voice(chat_interface_text,character):
    global voice #my local object that plays audio - but you can keep the save/load to file
    out = []
    sentences = nltk.sent_tokenize(chat_interface.value[-1].value.replace("</s>",""))
    silence = numpy.zeros(int(0.35 * 24000)) 
    for i, j in enumerate(sentences):
        out.append( model.inference(
            j,
            "en",
            gpt_cond_latent[character], #example passing in different latents and embeds for different chars
            speaker_embedding[character],
            temperature=0.7, # Add custom parameters here
            )
          )
    stitched = numpy.concatenate([numpy.append(i['wav'], silence) for i in out])
    voice.object = numpy.int16(numpy.array(stitched,dtype=numpy.float32)* 32767)
    #to write numpy obj to disk instead, run: torchaudio.save('file.wav', torch.tensor(voice.object).unsqueeze(0), 24000)
    return

I can't replicate your results. For my sample text, the extension took 11.7 seconds. Your code took 10.9, but that is with the custom generation parameters. without those it also took 11.7 seconds. Your speedup likely comes from the fact that you're using deepspeed

RandomInternetPreson commented 10 months ago

I too have reworked your code to accommodate the suggestion with no speed increase. However! I think I know the reason.

https://tts.readthedocs.io/en/latest/models/xtts.html

Check out this link, you need deepspeed enabled. I haven't done this yet, but I will today, I think I need to enable deepspeed in oobabooga to get it working. Look at oobs repo front page they have instructions on how to enable deepspeed.

https://github.com/oobabooga/text-generation-webui#deepspeed

kanttouchthis commented 10 months ago

unfortunately deepspeed isn't oficially supported on windows. could probably get it running on wsl though

RandomInternetPreson commented 10 months ago

This is also on my list to try out, I tried doing it last night but was getting errors with a wsl install. The deepspeed documentation says that it should work in wsl. I wasn't loading on with --deepspeed though.

RandomInternetPreson commented 10 months ago

Also according to the deepspeed documention it does work in Windows, with the caveat that it only works for inferencing. Which makes me think it might work on windows with oob since windows prebuilt wheels come installed.

Wuzzooy commented 10 months ago

i'm able to run deepspeed on windows with python 3.9, i failed with python 3.10/11 I used this file to install it https://huggingface.co/Jmica/audiobook_maker/tree/main pydantic has to be under 2.0 or you will get some error

dillonroach commented 10 months ago

@kanttouchthis yep, most of the big speed difference is from deepspeed; the other smaller chunk is likely the re-compute of the latents and embeddings when doing the 'clone' each time, but that's not a huge task. My general experience has been deepspeed can be a hassle to compile/run on a given env - it's certainly fantastic when you do get it working, but for those less used to dev work, it might be a slog if there isn't already just the right form packaged for them.

kanttouchthis / text_generation_webui_xtts

Low performance issue/question #5