audio_array loop in _call_impl() is not capable of generating the example phrase from the README

I discovered Bark today, been trying to simple generate the test sample from the README, but Bark hangs my resources until I interrupt the process.

All packages are installed without issues or warnings. I tried it in an AWS Ubuntu machine using only CPU and them on a Pro Google Collab account, both were not able to generate the phrase: inputs = processor("Hello, my dog is cute", voice_preset=voice_preset)

The loop is in this line from the code example: audio_array = model.generate(**inputs)

This is Google Collab status: Executing for (5m22s)
<cell line: 10> navigate_next decorate_context() navigate_next generate() navigate_next generate() navigate_next decorate_context() navigate_next generate() navigate_next sample() navigate_next _call_impl() navigate_next forward() navigate_next _call_impl() navigate_next forward() navigate_next _call_impl() navigate_next forward() navigate_next _call_impl() navigate_next forward()

so for google colab you also have to do nltk.download("punkt") this is for longer text, but you can just replace the scrip variable with whatever you want (also the .replace("\n", " ").strip() since that's just to remove the new line for formatting) here's my entire code: !pip install git+https://github.com/suno-ai/bark.git !pip install nltk

import os

from IPython.display import Audio import nltk # we'll use this to split into sentences nltk.download('punkt') # this is the line that needed to be added for me.. it kept throwing an error without it import numpy as np

from bark.generation import ( generate_text_semantic, preload_models, ) from bark.api import semantic_to_waveform from bark import generate_audio, SAMPLE_RATE

preload_models()

script = """ The environment is the physical or psychological climate where the messaging between the source and receiver is taking place. A room is a common environment where communication takes place. That environment can then include the tables, chairs, lighting, and sound equipment that are in the room. The environment can also include factors, like formal dress, that may indicate whether a discussion is open and caring or more professional and formal.

Context is all about what people expect from each other, and we often create those expectations out of environmental cues. Traditional gatherings like weddings are often formal events. There is a time for quiet social greetings, a time for silence as the bride walks down the aisle, and then also a time for more rambunctious celebration and dancing. You may be called upon to give a toast, and the wedding context will influence your presentation, timing, and effectiveness. So, in a business meeting, who speaks first? That probably has some relation to the position and role each person has outside the meeting. Context plays a very important role in communication, particularly across cultures. """.replace("\n", " ").strip()

sentences = nltk.sent_tokenize(script) SPEAKER = "v2/en_speaker_3" silence = np.zeros(int(0.25 * SAMPLE_RATE)) # quarter second of silence

pieces = [] for sentence in sentences: audio_array = generate_audio(sentence, history_prompt=SPEAKER) pieces += [audio_array, silence.copy()]

Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

suno-ai / bark

audio_array loop in _call_impl() is not capable of generating the example phrase from the README #421