neonbjb / tortoise-tts

A multi-voice TTS system trained with an emphasis on quality
Apache License 2.0
13.08k stars 1.8k forks source link

Save autoregressive samples for faster generation? #370

Closed Sitri2002 closed 1 year ago

Sitri2002 commented 1 year ago

Every time you run tortoise, you have to generate the regressive models again for existing audio. This seems pretty inefficient if you only want to generate one type of voice and want to use it a lot of times in the long run. Is there a way to "save" the weightings for an input voice sample and then continously generate new text with that weighting instead of having to restart it every time (which takes lots of time)?

neonbjb commented 1 year ago

Hey Jack, That's not actually how Tortoise uses the conditioning input. These inputs are fed into the model through a separate transformer and are "compiled" into a single "conditioning latent". There's no autoregression in this process and it is actually quite zippy compared to the rest of the sampling process.

I do actually provide a way to "precompile" these latents: https://github.com/neonbjb/tortoise-tts#generating-conditioning-latents-from-voices

I don't really recommend bothering with this, though. On most GPUs it'll shave a hundred milliseconds or so. The reason I added this feature was to give researchers a way to tinker with the latents, which can be used to get some interesting behavior.

bluusun commented 1 year ago

I tried this:

python3 get_conditioning_latents.py --voice halle_berry Traceback (most recent call last): File "/root/tortoise-tts/tortoise/get_conditioning_latents.py", line 23, in cond_paths = voices[voice] KeyError: 'halle_berry'

root@C.6077805:~/tortoise-tts/tortoise/voices$ ls -lh total 0 drwxr-xr-x 2 root root 45 Jan 13 19:51 halle_berry drwxr-xr-x 2 root root 45 Jan 13 19:51 morgan_freeman

cristianmercado19 commented 1 year ago

I believe I got the point of @Sitri2002. Although I do not have the understanding of this tool... I have tried to convert the text My god baby!, you are amazing into speech with the voice tom and ultra_fas preset. This took 10 minutes. My laptop Windows11: Processor 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz 2.92 GHz Installed RAM 16.0 GB (15.7 GB usable) System type 64-bit operating system, x64-based processor Display Intel Iris Xe Graphics 8165MB (NO NVIDIA GPU)

The command: (tortoise) PS C:\Repositories\tortoise-tts> python tortoise/do_tts.py --text "My god baby!, you are amazing" --voice tom --preset ultra_fast
8 minutes just took the first step "Generating autoregressive samples.."

Output details:

Generating autoregressive samples.. 100%|█| 16/16 [08:46<00:00, 32.88s/it] Computing best candidates using CLVP 100%|█| 16/16 [00:58<00:00, 3.65s/it] Transforming autoregressive outputs into audio.. 100%|█| 30/30 [00:36<00:00, 1.20s/it] 100%|█| 30/30 [00:32<00:00, 1.09s/it] 100%|█| 30/30 [00:40<00:00, 1.34s/it]

I believe the key idea is.... How we can reduce the time of the process.

Sounds like the first step ("Generating autoregressive samples..) is taking much time in the overall process. Maybe there is a pre-process / compilation or model generation that we need to run before the do_tts.py call. Or maybe is the way it works. I could not spot it or other performance implications just reading the readme file.

64jcl commented 11 months ago

So are the autoregressive samples for the provided text in the context? Or are they something that can be pre-run for any voice and then used to generate output faster? As Cristian here, I have no idea how any of this works, but also noticed it used a long time on "generating autoregressive samples", before the last parts - it sounded like it was taking the input wav files for a voice and did some training over and over before converting the real text into the wav in the last steps. But since this has been closed I guess this is not the case?

mikeymezher commented 8 months ago

Hey Jack, That's not actually how Tortoise uses the conditioning input. These inputs are fed into the model through a separate transformer and are "compiled" into a single "conditioning latent". There's no autoregression in this process and it is actually quite zippy compared to the rest of the sampling process.

I do actually provide a way to "precompile" these latents: https://github.com/neonbjb/tortoise-tts#generating-conditioning-latents-from-voices

I don't really recommend bothering with this, though. On most GPUs it'll shave a hundred milliseconds or so. The reason I added this feature was to give researchers a way to tinker with the latents, which can be used to get some interesting behavior.

Old thread, but wanted to make a note here - storing these latents might save significantly more time than a hundred ms or so. In fact, depending on the number of voice conditioning clips used, time savings are quite meaningful. On my 4090 mobile, using 14 audio clips of ~10s each, calculating these latents takes almost 5s. This process scales close to linearly according to the number of latents used (which make sense, since a loop is used over the voice samples to get the mel spectrograms)

HOWEVER Did some further digging to see exactly what was eating the most time. It seems: "audio.wav_to_univnet_mel(...)" takes the majority of the time (which occurs within the loop over voice samples). Within this function a TacotronSTFT is initialized and a mel spectrogram given the wav clip is calculated.

Initialization (and sending to device) of the TacotronSTFT object takes ~250ms and calculation of the mel spectrogram takes only ~1ms. There doesn't appear to be any reason TacotronSTFT requires reinitialization for each voice sample, initializing once and storing the object on device would save a significant amount of time. I'll likely file a PR for this.

mikeymezher commented 8 months ago

See PR: https://github.com/neonbjb/tortoise-tts/pull/725

mikeymezher commented 8 months ago

Btw, thank you @neonbjb - this is an awesome library and a well thought out architecture.