neonbjb / tortoise-tts

A multi-voice TTS system trained with an emphasis on quality
Apache License 2.0
12.35k stars 1.73k forks source link

Voice process starts fine and quickly fails with noise and complete loss #711

Open brunoais opened 6 months ago

brunoais commented 6 months ago

Setup (note I installed 11.8 and not 11.7 because 11.7 was not working for me):

conda create --name tortoise python=3.9 numba inflect
conda activate tortoise
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
conda install transformers=4.29.2

tortoise-tts Version

https://github.com/neonbjb/tortoise-tts/tree/1e061bc6752f05bccb59748c8bd7c7fc85d54988

Command:

cd /path/to/tortoise-tts/scripts/
python tortoise_tts.py -p "fast" -v deniro -O /path/to/output/ "[poem] Love, me tender... Love me sweet... Never let me go... For this makes, my life complete... Never let me go?"

Censored logs

Rendering deniro_00 (1 of 1)...
  [poem] Love, me tender... Love me sweet... Never let me go... For this makes, my life complete... Never let me go?
Generating autoregressive samples..
100%|██████████| 96/96 [15:21<00:00,  9.60s/it]
Computing best candidates using CLVP
100%|██████████| 96/96 [00:07<00:00, 13.33it/s]
Transforming autoregressive outputs into audio..
100%|██████████| 80/80 [00:45<00:00,  1.75it/s]

Samples

deniro_00_00.webm deniro_combined.webm

System:

OS: linux; ubuntu 22.04 CPU: Ryzen 4800H GPU: NVIDIA TU106M [GeForce RTX 2060.2 Mobile]

Notes:

Seems to take 8GB of RAM and CPU at single core 100% and also 3.5GB of VRAM (although the GPU has 6GB) Seems to take a long time to process, about 5-10 minutes per generation of that size. Looks like too much but it is using the dGPU at about half processing capacity to generate the file. Works here, though: https://huggingface.co/spaces/Manmay/tortoise-tts

Tortoise17 commented 4 months ago

I also have same issue. Is there any solution for this or any hint to get rid of such fuzzy ending parts?