huggingface / parler-tts

Inference and training library for high-quality TTS models.
Apache License 2.0
2.6k stars 265 forks source link

Model stumbling on its words #24

Open samuelbraun04 opened 1 month ago

samuelbraun04 commented 1 month ago

Running the following code:

from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

description = "A male speaker with a low-pitched voice delivering his words at a slow pace in a small, confined space with a very clear audio and an animated tone."
prompt = "In the annals of history, the ink that drafted peace often dried under the shadow of future conflicts. Today, we dive deep into the bottom 10 worst peace treaties ever signed, the naive hopes and the grim repercussions they bore, unraveling a tapestry of unintended consequences that would haunt nations for generations. From agreements that sowed the seeds of resentments leading to catastrophic wars, to those that carved up continents disregarding the people who lived there, we explore how peace can sometimes lead to anything but."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write(os.path.join('output.wav'), audio_arr, model.config.sampling_rate)

Outputs the .wav file posted at this link: http://sndup.net/vzyp

How can I get it to correctly output the prompt text? Is my prompt too large? Am I using the model incorrectly? Thank you!

bkutasi commented 1 month ago

This looks like a way too long prompt, I would do 1 prompt/sentence at most.

ai-bits commented 1 month ago

@bkutasi Hey Balázs, any idea how to pimp things to make longer prompts work? Got 256GB RAM and 2x 20GB VRAM RTX 4000. Sorta waste. ;-) Thought I had a util now to read articles to me. Maybe sth like in helpers\model_init_scripts\init_dummy_model.py

# set other default generation config params model.generation_config.max_length = int(30 * model.audio_encoder.config.frame_rate)

Greetings from Upper Austria! G.

bkutasi commented 1 month ago

@ai-bits Hallo Günter and greetings from Tirol! So I think we won't be able to ever generate minutes long audios ever, but maybe if we get chucks or some kind of sequential generation implemented (at the moment it was trained on 30 sec voice samples). So this is kind of a software limitation and wish I had time to fork/contribute to this because I'm really interested in good TTS implementations. If you just want simply more output, you can build a pipeline to call it multiple times for every sentence or line, but the voices will be vastly different as this cant reproduce the same exact voice. There were some talks about this here: https://github.com/huggingface/parler-tts/issues/9 https://github.com/huggingface/parler-tts/issues/11

platform-kit commented 1 month ago

@bkutasi why not ever? What's the blocker to re-training with longer samples?

ai-bits commented 1 month ago

wish I had time to fork/contribute

Just too many fronts. @ current pace give Parler coupla weeks to mature & use alts 2 do tts. stt Whisper pretty good v Dragon Naturally Speaking in 90ies. \<grin> Been talking to GPT since July & VS Code for weeks.

Cheers G. PS: 49" moni, 4 hd Chrome Windoze: this, Code, Bayern-Ars, MCI-RMA FCB just won. MCI overtime w/ projector in bed.

ai-bits commented 1 month ago

@bkutasi Been following the whole AI craze, but lost track what's free & local (and really open needed?) in TTS.

Not yet decided what to think of LocalAI.io, but just got TTS debugged on the M3 Mac. (AMD64 via Docker) Still don't know what special chars sometimes throw it off w/ text from the clipboard.

curl http://localhost:8080/tts -H "Content-Type: application/json" -d "{ \"model\":\"tts-1\", \"input\": \"$(pbpaste)\" }" --output ~/local-ai.wav

Bark,.. available. Hungarian (?) im Heiligen Landl? ;-) Cheers G.