[Inference Speed] How we can shorten the inference time per generation?

rsong0606 commented 4 months ago

Hey Team, good work overall!

I am using this sample code, played a bit with different descriptions. Overall, this is great. However, it took 9 seconds to generate a 20 tokens text.

import torch
import time
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")
prompt = "Hey, how are you doing today? How about we go for lunch later today?"
description = "Talia with a young and cute voice delivers her words in an energetic and cheerful tone. The audio quality is high quality, capturing the lively and playful nuances of her speech. She speaks quickly, with occasional pauses to breathe, adding a sense of fun to the conversation. Emotion: happy."

start_time = time.time()

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Talia's voice generation time: {elapsed_time:.2f} seconds")

audio_arr = generation.cpu().numpy().squeeze()

Talia's voice generation time: 8.13 seconds

bxclib2 commented 4 months ago

I also tried. The inference is very slow. Anyone can have a look for this issue?

sang-nguyen-ts commented 3 months ago

Which hardware are you using? can you please provide more env setup?

Guppy16 commented 3 months ago

Checkout their guide on speeding up inference: https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md

huggingface / parler-tts

[Inference Speed] How we can shorten the inference time per generation? #86