danfouer commented 1 year ago

Describe the bug

I am working with Hugging Face Audio course' s Unit 7 Hand-on task of "taking speech in language X, and translating it to speech in language Y". I used "openai/whisper-large-v2"+"suno/bark-small" to translate language X in Chinese. I have tested my code on colab as attached using T4 GPU. However, the struct.error: ushort format requires 0 <= number <= (0x7fff * 2 + 1) stops me from buliding a Gradio demo , neverless to say running on the Hugging Face Space. Could someone check my colab code and help me out?

https://colab.research.google.com/drive/1YF30abCKF5ALijjL9ydrVWFq71u5Cy-o?usp=sharing

Have you searched existing issues? 🔎

[X] I have searched and found no existing issues

Reproduction

-- coding: utf-8 --

"""HFAudio_Unit7_Hands-onexercise.ipynb

Automatically generated by Colaboratory.

Original file is located at https://colab.research.google.com/drive/1YF30abCKF5ALijjL9ydrVWFq71u5Cy-o """

!pip install datasets !pip install --upgrade --quiet pip !pip install --quiet git+https://github.com/huggingface/transformers.git !pip install --upgrade accelerate !pip install gradio

from huggingface_hub import notebook_login import torch from transformers import pipeline from datasets import load_dataset from IPython.display import Audio from transformers import BarkModel from transformers import AutoProcessor import numpy as np

!nvidia-smi

notebook_login()

"""# Speech translation"""

device = "cuda:0" if torch.cuda.is_available() else "cpu"

device="cpu"

pipe = pipeline( "automatic-speech-recognition", model="openai/whisper-large-v2", device=device )

dataset = load_dataset("facebook/voxpopuli", "en", split="validation", streaming=True) sample = next(iter(dataset)) print(sample)

Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])

def translate(audio): outputs = pipe(audio, max_new_tokens=256, generate_kwargs={"task": "transcribe","language":"chinese"}) return outputs["text"]

translate(sample["audio"].copy())

"""# Text-to-speech"""

model = BarkModel.from_pretrained("suno/bark-small") processor = AutoProcessor.from_pretrained("suno/bark")

device = "cuda:0" if torch.cuda.is_available() else "cpu" model = model.to(device)

def synthesise(text_prompt,voice_preset="v2/zh_speaker_1"): inputs = processor(text_prompt, voice_preset=voice_preset) speech_output = model.generate(**inputs.to(device)) synthesised_rate = model.generation_config.sample_rate return synthesised_rate,speech_output

target_dtype = np.int16 max_range = np.iinfo(target_dtype).max

def speech_to_speech_translation(audio,voice_preset="v2/zh_speaker_1"): translated_text = translate(audio) synthesised_rate,synthesised_speech = synthesise(translated_text,voice_preset) synthesised_speech = (synthesised_speech.cpu().numpy() * max_range).astype(np.int16) return synthesised_rate,synthesised_speech

synthesised_rate,synthesised_speech = speech_to_speech_translation(sample["audio"],"v2/zh_speaker_1")

Audio(synthesised_speech, rate=synthesised_rate)

import gradio as gr

demo = gr.Blocks()

mic_translate = gr.Interface( fn=speech_to_speech_translation, inputs=gr.Audio(source="microphone", type="filepath"), outputs=gr.Audio(label="Generated Speech", type="numpy"), )

file_translate = gr.Interface( fn=speech_to_speech_translation, inputs=gr.Audio(source="upload", type="filepath"), outputs=gr.Audio(label="Generated Speech", type="numpy"), )

with demo: gr.TabbedInterface([mic_translate, file_translate], ["Microphone", "Audio File"])

demo.launch(debug=True)

Screenshot

GradioError

Logs

/usr/local/lib/python3.10/dist-packages/gradio/processing_utils.py:188: UserWarning: Trying to convert audio automatically from int32 to 16-bit int format.
  warnings.warn(warning.format(data.dtype))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/routes.py", line 534, in predict
    output = await route_utils.call_process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1563, in process_api
    data = self.postprocess_data(fn_index, result["prediction"], state)
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1451, in postprocess_data
    prediction_value = block.postprocess(prediction_value)
  File "/usr/local/lib/python3.10/dist-packages/gradio/components/audio.py", line 341, in postprocess
    file_path = self.audio_to_temp_file(
  File "/usr/local/lib/python3.10/dist-packages/gradio/components/base.py", line 335, in audio_to_temp_file
    processing_utils.audio_to_file(sample_rate, data, filename, format=format)
  File "/usr/local/lib/python3.10/dist-packages/gradio/processing_utils.py", line 175, in audio_to_file
    file = audio.export(filename, format=format)
  File "/usr/local/lib/python3.10/dist-packages/pydub/audio_segment.py", line 895, in export
    wave_data.writeframesraw(pcm_for_wav)
  File "/usr/lib/python3.10/wave.py", line 426, in writeframesraw
    self._ensure_header_written(len(data))
  File "/usr/lib/python3.10/wave.py", line 467, in _ensure_header_written
    self._write_header(datasize)
  File "/usr/lib/python3.10/wave.py", line 479, in _write_header
    self._file.write(struct.pack('<L4s4sLHHLLHH4s',
struct.error: ushort format requires 0 <= number <= (0x7fff * 2 + 1)
Exception ignored in: <function Wave_write.__del__ at 0x7fb3cf3b28c0>
Traceback (most recent call last):
  File "/usr/lib/python3.10/wave.py", line 326, in __del__
    self.close()
  File "/usr/lib/python3.10/wave.py", line 444, in close
    self._ensure_header_written(0)
  File "/usr/lib/python3.10/wave.py", line 467, in _ensure_header_written
    self._write_header(datasize)
  File "/usr/lib/python3.10/wave.py", line 479, in _write_header
    self._file.write(struct.pack('<L4s4sLHHLLHH4s',
struct.error: ushort format requires 0 <= number <= (0x7fff * 2 + 1)
/usr/local/lib/python3.10/dist-packages/gradio/processing_utils.py:188: UserWarning: Trying to convert audio automatically from int32 to 16-bit int format.
  warnings.warn(warning.format(data.dtype))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/routes.py", line 534, in predict
    output = await route_utils.call_process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1563, in process_api
    data = self.postprocess_data(fn_index, result["prediction"], state)
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1451, in postprocess_data
    prediction_value = block.postprocess(prediction_value)
  File "/usr/local/lib/python3.10/dist-packages/gradio/components/audio.py", line 341, in postprocess
    file_path = self.audio_to_temp_file(
  File "/usr/local/lib/python3.10/dist-packages/gradio/components/base.py", line 335, in audio_to_temp_file
    processing_utils.audio_to_file(sample_rate, data, filename, format=format)
  File "/usr/local/lib/python3.10/dist-packages/gradio/processing_utils.py", line 175, in audio_to_file
    file = audio.export(filename, format=format)
  File "/usr/local/lib/python3.10/dist-packages/pydub/audio_segment.py", line 895, in export
    wave_data.writeframesraw(pcm_for_wav)
  File "/usr/lib/python3.10/wave.py", line 426, in writeframesraw
    self._ensure_header_written(len(data))
  File "/usr/lib/python3.10/wave.py", line 467, in _ensure_header_written
    self._write_header(datasize)
  File "/usr/lib/python3.10/wave.py", line 479, in _write_header
    self._file.write(struct.pack('<L4s4sLHHLLHH4s',
struct.error: ushort format requires 0 <= number <= (0x7fff * 2 + 1)
Exception ignored in: <function Wave_write.__del__ at 0x7fb3cf3b28c0>
Traceback (most recent call last):
  File "/usr/lib/python3.10/wave.py", line 326, in __del__
    self.close()
  File "/usr/lib/python3.10/wave.py", line 444, in close
    self._ensure_header_written(0)
  File "/usr/lib/python3.10/wave.py", line 467, in _ensure_header_written
    self._write_header(datasize)
  File "/usr/lib/python3.10/wave.py", line 479, in _write_header
    self._file.write(struct.pack('<L4s4sLHHLLHH4s',
struct.error: ushort format requires 0 <= number <= (0x7fff * 2 + 1)

System Info

Gradio Environment Information:
------------------------------
Operating System: Linux
gradio version: 3.47.1
gradio_client version: 0.6.0

------------------------------------------------
gradio dependencies in your environment:

aiofiles: 23.2.1
altair: 4.2.2
fastapi: 0.103.2
ffmpy: 0.3.1
gradio-client==0.6.0 is not installed.
httpx: 0.25.0
huggingface-hub: 0.17.3
importlib-resources: 6.1.0
jinja2: 3.1.2
markupsafe: 2.1.3
matplotlib: 3.7.1
numpy: 1.23.5
orjson: 3.9.7
packaging: 23.2
pandas: 1.5.3
pillow: 9.4.0
pydantic: 1.10.13
pydub: 0.25.1
python-multipart: 0.0.6
pyyaml: 6.0.1
requests: 2.31.0
semantic-version: 2.10.0
typing-extensions: 4.5.0
uvicorn: 0.23.2
websockets: 11.0.3
authlib; extra == 'oauth' is not installed.
itsdangerous; extra == 'oauth' is not installed.

gradio_client dependencies in your environment:

fsspec: 2023.6.0
httpx: 0.25.0
huggingface-hub: 0.17.3
packaging: 23.2
requests: 2.31.0
typing-extensions: 4.5.0
websockets: 11.0.3

Severity

Blocking usage of gradio

abidlabs commented 1 year ago

Hi @danfouer this doesn't seem to be a gradio-related issue. The error message you've shown suggests there's an issue when trying to write a .wav file using Python's built-in wave module.

Would you be able to provide us a simpler repro -- perhaps taking the file that is produced by your function directly and seeing if Gradio is able to work with that? As it stands, the repro you've provided is quite complex to get running.

danfouer commented 1 year ago

Hi @abidlabs, Thanks for reply me. A friend help me figured it out . He found somewhere the dimensions of my speech representation are getting reversed before or in wave.py, so it's (num_channels, num_samples) and it should be (num_samples, num_channels). see: https://stackoverflow.com/questions/40822877/scipy-io-cant-write-wavfile Using speech_to_speech_translation_fix may solve my problem. def speech_to_speech_translation_fix(audio,voice_preset="v2/zh_speaker_1"): synthesised_rate,synthesised_speech = speech_to_speech_translation(audio,voice_preset) return synthesised_rate,synthesised_speech.T

and you can find my application here: https://huggingface.co/spaces/zongxiao/speech-to-speech

By the way ,could gradio one input two output, for example, I also want output the text from translate function, then output audio from speech_to_speech_translation_fix funtion, could gradio do this for me?