GenerationMixin sample() runs forever

BirgitPohl commented 2 weeks ago

System Info

python 3.12.2 transformer version: 4.37.2 but also 4.41.2 (I tried switching around the versions to see the difference) I'm using a CPU, Mac Book Pro, Chip M2, Memory 24 GB

I used BarkModel to generate a text to speech output and noticed it runs forever. I speaking of more than 30 minutes for a text such as 'sample text'. Debugging it I found that it loops through a while True loop in the sample() method.

import nltk
import torch
import warnings
import numpy as np
from transformers import AutoProcessor, BarkModel
from rich.console import Console
import sounddevice as sd

console = Console()

warnings.filterwarnings(
    "ignore",
    message="torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.",
)

class TextToSpeechService:
    def __init__(self, device: str = "cuda" if torch.cuda.is_available() else "cpu"):
        """
        Initializes the TextToSpeechService class.

        Args:
            device (str, optional): The device to be used for the model, either "cuda" if a GPU is available or "cpu".
            Defaults to "cuda" if available, otherwise "cpu".
        """
        self.device = device
        self.processor = AutoProcessor.from_pretrained("suno/bark-small")
        self.model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16)
        self.model.to(self.device)

    def synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"):
        """
        Synthesizes audio from the given text using the specified voice preset.

        Args:
            text (str): The input text to be synthesized.
            voice_preset (str, optional): The voice preset to be used for the synthesis. Defaults to "v2/en_speaker_1".

        Returns:
            tuple: A tuple containing the sample rate and the generated audio array.
        """
        inputs = self.processor('sample text', voice_preset=voice_preset, return_tensors="pt")
        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        with torch.no_grad():
            console.print(f"generate audio")
            audio_array = self.model.generate(**inputs, pad_token_id=10000) # TODO it hangs here
            console.print(f"completed")

        audio_array = audio_array.cpu().numpy().squeeze()
        sample_rate = self.model.generation_config.sample_rate
        return sample_rate, audio_array

    def long_form_synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"):
        """
        Synthesizes audio from the given long-form text using the specified voice preset.

        Args:
            text (str): The input text to be synthesized.
            voice_preset (str, optional): The voice preset to be used for the synthesis. Defaults to "v2/en_speaker_1". List of speakers: https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c

        Returns:
            tuple: A tuple containing the sample rate and the generated audio array.
        """
        pieces = []
        sentences = nltk.sent_tokenize(text)
        silence = np.zeros(int(0.25 * self.model.generation_config.sample_rate))

        for sent in sentences:
            console.print(f"A sent is: {sent}")
            sample_rate, audio_array = self.synthesize(sent, voice_preset)
            pieces += [audio_array, silence.copy()]

        return self.model.generation_config.sample_rate, np.concatenate(pieces)

if __name__ == "__main__":
    tts = TextToSpeechService()
    sample_rate, audio_array = tts.long_form_synthesize('hi')
    sd.play(audio_array, sample_rate)

When I first tried it out, I didn't have a problem. But after I refactored something in the main, but not touching anything from tts or the generate() method. Later then I left out what I did on the main, to see what was happening here and it would still run for over 30 minutes for one little audio output.

I'd like to have some guideline on how I can achieve reaching a break fast. Meaning, within a couple of seconds at least for a good Mac CPU.

Who can help?

Tagging @gante for genration and @stevhliu for documentation. I checked the documentation, and I could find guidelines to optimize, but none that helps me how I can reach a break in the while True look fast.

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Install libraries on a device with CPU
copy the code that hopefully is still close enough to the official documentation
run and see how long it takes for you

Expected behavior

execution of sample() of GenerationMixin shouldn't take over 30 minutes for CPU devices until I need to decide to give up. It should rather take a couple of seconds if it is supposed to be slow on CPU.

zucchini-nlp commented 2 weeks ago

@BirgitPohl hey! I tried to run the provided script and it runs a few seconds on a GPU and also stuck for CPU. I think it's not stuck forever but simply taking more time for CPU as generating audio requires more tokens than text , which is expected :)

BirgitPohl commented 2 weeks ago

@BirgitPohl hey! I tried to run the provided script and it runs a few seconds on a GPU and also stuck for CPU. I think it's not stuck forever but simply taking more time for CPU as generating audio requires more tokens than text , which is expected :)

That is good, it worked for you. :)

How much time did you have using a CPU until you got an output? How much time would you give it for a CPU?

Also did you consider that I mentioned that it wasn't a problem with my first attempts but then it became? I got a result after a couple of seconds. This is what I expect with a CPU. Give it 5 or 10 seconds. I'm fine with that.

Again, I did not touch the generate() method at all, when I refactored stuff and since I have this issue, I minimized the code to this in the entry post. The generate() method now gets an even shorter string. I do see some calculations happening, since I spread some console outputs in the sample() method watching the inputs_ids variable growing endlessly. But even after 30mins I still didn't hear an audio. Would the "small" Bark model really take that much time?

And I have absolutely no clue of why that is and how I can manipulate it so that I get the same experience with my first attempts. I wonder if I can try out something with input_ids that I can define on the generate() method.

gante commented 2 weeks ago

@BirgitPohl autoregressive generation is computationally expensive -- depending on the model and your CPU, taking a few minutes is not strange at all. Tagging @Vaibhavs10 here, who might be familiar with BARK/audio strategies for inference on CPU.

A note: it is expected that you see different run times (and results) across different runs. BARK relies on sampling, i.e. its runs are not deterministic and may result in a different number of tokens. Have a look at this guide for the basics of auto-regressive generation -- the principles for LLM or audio generation are the same.

not-lain commented 1 week ago

@BirgitPohl since you are on mac you can use the mps device simply use the following code and you should in theory be good to go

class TextToSpeechService:
    def __init__(self, device: str = "mps" if torch.backends.mps.is_available() else "cpu"):
             (....)

I do not have a mac and I can't debug this here are some extra links on how to use you Mac chip :

let me know how this goes and happy coding ✨

huggingface / transformers