huggingface / parler-tts

Inference and training library for high-quality TTS models.
Apache License 2.0
2.6k stars 265 forks source link

Benchmarks of parler-tts, the emergence of TTS! #19

Open BBC-Esq opened 1 month ago

BBC-Esq commented 1 month ago

Hey @sanchit-gandhi, like the repo. Excited to see this being worked on. Here's a benchmark of WhisperSpeech. I used your sample script on the same exact text snippet and it finished processing in 16.04 seconds. However, this repo is in float32 while I think WhisperSpeech is being run in float16. Can you provide me with the modification to run in float16 or bfloat16 even? I'm going to do a comparison of this, Bark, and WhisperSpeech:

image

I want to add that this says nothing about the quality, only speed. I'll evaluate quality next after I ensure comparable testing procedures regarding compute time. Here's the script I used:

SCRIPT HERE ``` import time import sounddevice as sd import torch from transformers import AutoTokenizer from parler_tts import ParlerTTSForConditionalGeneration # Setup device device = "cuda:0" if torch.cuda.is_available() else "cpu" # Load model and tokenizer model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device) tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1") # Prepare input prompt = "This script processes a body of text one sentence at a time and plays them consecutively. This enables the audio playback to begin sooner instead of waiting for the entire body of text to be processed. The script uses the threading and queue modules that are part of the standard Python library. It also uses the sound device library, which is fairly reliable across different platforms. I hope you enjoy, and feel free to modify or distribute at your pleasure." description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast." # Start timer start_time = time.time() input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device) prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids) audio_arr = generation.cpu().numpy().squeeze() # End timer end_time = time.time() processing_time = end_time - start_time # Print processing time in green print(f"\033[92mProcessing time: {processing_time:.2f} seconds\033[0m") sampling_rate = model.config.sampling_rate sd.play(audio_arr, samplerate=sampling_rate) sd.wait() ```

Lastly, let me know what other "speedups" I can use such as bettertransformer, which I think is part of torch now unless I'm mistaken. I can't test FA2 unless you help me install it. I've tried.

BBC-Esq commented 1 month ago

Hey @sanchit-gandhi here's updated comparisons. Feel free to let me know how to cast in float16/bfloat16 if you want and/or use bitsandbytes or whatever this type of model is compatible with. Congratulations to HF, pretty impressive for a ".1" version model. Looking forward to version 1.0!

image

image

80boys commented 1 month ago

@BBC-Esq What model is s2q-q4?

BBC-Esq commented 1 month ago

The WhisperSpeech library uses two types of models, s2a and t2s and there are multiples of each, so this benchmark tests every permutation/combination.

ylacombe commented 1 month ago

Hey @BBC-Esq, this is a really interesting benchmark, thanks for sharing ! To run Parler-TTS Mini in fp16, you only need to add torch_dtype=torch.float16 to the from_pretrained:

import torch
from parler_tts import ParlerTTSForConditionalGeneration
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1", torch_dtype=torch.float16)
BBC-Esq commented 4 weeks ago

Awesome, thanks dude! Don't know why I didn't realize that. lol. Anyways, here's the updated bench. About the same processing time but about 30% less VRAM used. At a certain point it basically comes down to preference and/or quality. For example, Bark has present voices you hardcode, WhisperSpeech can actually use 30 second audio clips at runtime to clone a voice, and Parler can create a voice based off of a description. I haven't tested Parler's approach yet...but just want to be clear that my testing doesn't address quality/preference, which can be highly subjective.

For example, Parler uses the same VRAM as Bark small float32. It might be worth it to someone, however, because of the voice...just to give one example.

Also, notice that a few numbers changed, which guessing is due to me starting to use the "relatively" new torch @torch.inference_mode() decorator.

image

image

ylacombe commented 3 weeks ago

This is great ! Thanks so much for sharing!