[show and tell] apple mps support

huggingface / parler-tts

Inference and training library for high-quality TTS models.

Apache License 2.0

2.6k stars 265 forks source link

[show and tell] apple mps support #6

Open bghira opened 1 month ago

bghira commented 1 month ago

with newer pytorch (2.4 nightly) we get bfloat16 support in MPS.

i tested this:

from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import torch

device = "mps:0"

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device=device, dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

prompt = "welcome to huggingface"
description = "An old man."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device=device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device=device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.to(torch.float32).cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

sanchit-gandhi commented 1 month ago

That's awesome, thanks for sharing @bghira! How fast was inference on your local machine?

bghira commented 1 month ago

it gets slower as the sample size increases but this test script takes about 10 seconds to run on an M3 Max.

maxtheman commented 1 month ago

I got this working as well! Inference time seems to increase more than linearly with prompt size

3 seconds of audio: 10 seconds of generation
8s of audio: ~90 seconds of generation
10 of audio: ~3min of generation

I think the reason is that itself takes a surprising amount of memory — loading the model takes the expected ~3GB of memory, but then inference takes 15 GB on top of that, which is probably what's slowing it down on my machine (16GB M2).

QueryType commented 1 month ago

I got this working as well! Inference time seems to increase more than linearly with prompt size

3 seconds of audio: 10 seconds of generation

8s of audio: ~90 seconds of generation

10 of audio: ~3min of generation

I think the reason is that itself takes a surprising amount of memory — loading the model takes the expected ~3GB of memory, but then inference takes 15 GB on top of that, which is probably what's slowing it down on my machine (16GB M2).

Swapping activated? I will try on Mac Mini M2 (24GB). Do we know the performance on CUDA on similar machine?

bghira commented 1 month ago

on the 128gb M3 Max i can get pretty far into the output window before the time increases to 3 minutes.

it'll take about a minute for 30 seconds of audio.

QueryType commented 1 month ago

of

I am getting, 2s of audio: 11 seconds and 6s of audio: 36 seconds

janewu77 commented 1 month ago

my data , on 64G M2 Max	seconds of audio	cpu(seconds of generation)
1	7	10
3	13	17
7	30	44
9	41	194
18	71	308