suno-ai / bark

🔊 Text-Prompted Generative Audio Model
MIT License
35.28k stars 4.14k forks source link

scipy.io.wavefile.write error preventing output #478

Open bagsorbet opened 11 months ago

bagsorbet commented 11 months ago

Hi all,

I'm trying to get bark up and running, and used the example code to see if it's working.

OS is Ubuntu 22.04. Running the latest stable release of python3, using pytorch for CUDA 12.2, can provide more details if necessary (I am very inexperienced with these tools so please pardon me if there is a glaring omission in details pertinent to diagnosing the problem).

Here's what happens when I use the example:

>>> from transformers import pipeline
>>> import scipy
>>> 
>>> synthesiser = pipeline("text-to-speech", "suno/bark-small")
/home/[MY_USER_ACCOUNT]/.local/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
>>> 
>>> speech = synthesiser("Hello, my dog is cooler than you!", forward_params={"do_sample": True})
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
>>> 
>>> scipy.io.wavfile.write("bark_out.wav", rate=speech["sampling_rate"], data=speech["audio"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/[MY_USER_ACCOUNT]/.local/lib/python3.10/site-packages/scipy/io/wavfile.py", line 797, in write
    fmt_chunk_data = struct.pack('<HHIIHH', format_tag, channels, fs,
struct.error: 'I' format requires 0 <= number <= 4294967295

Is this related to the deprecated packages? I searched around for this error, and the only thing I found seems entirely unrelated (something about seconds since 1970, but since this is about the format and not time, I am pretty sure that has no bearing whatsoever on my problem).

So, any ideas? :-)

Cazforshort commented 9 months ago

Yes. This was poorly documented.

You need to adjust your audio array to write to the wav file if you are running in float16.

audio_array = speech_output.cpu().numpy().squeeze()
audio_array /=1.414
audio_array *= 32767
audio_array = audio_array.astype(np.int16)
# print(audio_array)

scipy.io.wavfile.write("bark_out_bet.wav", rate=sampling_rate, data=audio_array)