Closed danfouer closed 1 year ago
Hi @danfouer this doesn't seem to be a gradio-related issue. The error message you've shown suggests there's an issue when trying to write a .wav file using Python's built-in wave module.
Would you be able to provide us a simpler repro -- perhaps taking the file that is produced by your function directly and seeing if Gradio is able to work with that? As it stands, the repro you've provided is quite complex to get running.
Hi @abidlabs, Thanks for reply me. A friend help me figured it out . He found somewhere the dimensions of my speech representation are getting reversed before or in wave.py, so it's (num_channels, num_samples) and it should be (num_samples, num_channels). see: https://stackoverflow.com/questions/40822877/scipy-io-cant-write-wavfile Using speech_to_speech_translation_fix may solve my problem. def speech_to_speech_translation_fix(audio,voice_preset="v2/zh_speaker_1"): synthesised_rate,synthesised_speech = speech_to_speech_translation(audio,voice_preset) return synthesised_rate,synthesised_speech.T
and you can find my application here: https://huggingface.co/spaces/zongxiao/speech-to-speech
By the way ,could gradio one input two output, for example, I also want output the text from translate function, then output audio from speech_to_speech_translation_fix funtion, could gradio do this for me?
Ok great! I'll go ahead and close as this is not a gradio-related issue
By the way ,could gradio one input two output, for example, I also want output the text from translate function, then output audio from speech_to_speech_translation_fix funtion, could gradio do this for me?
Yes, this is totally possible. See https://www.gradio.app/guides/quickstart#multiple-input-and-output-components
Describe the bug
I am working with Hugging Face Audio course' s Unit 7 Hand-on task of "taking speech in language X, and translating it to speech in language Y". I used "openai/whisper-large-v2"+"suno/bark-small" to translate language X in Chinese. I have tested my code on colab as attached using T4 GPU. However, the struct.error: ushort format requires 0 <= number <= (0x7fff * 2 + 1) stops me from buliding a Gradio demo , neverless to say running on the Hugging Face Space. Could someone check my colab code and help me out?
https://colab.research.google.com/drive/1YF30abCKF5ALijjL9ydrVWFq71u5Cy-o?usp=sharing
Have you searched existing issues? š
Reproduction
-- coding: utf-8 --
"""HFAudio_Unit7_Hands-onexercise.ipynb
Automatically generated by Colaboratory.
Original file is located at https://colab.research.google.com/drive/1YF30abCKF5ALijjL9ydrVWFq71u5Cy-o """
!pip install datasets !pip install --upgrade --quiet pip !pip install --quiet git+https://github.com/huggingface/transformers.git !pip install --upgrade accelerate !pip install gradio
from huggingface_hub import notebook_login import torch from transformers import pipeline from datasets import load_dataset from IPython.display import Audio from transformers import BarkModel from transformers import AutoProcessor import numpy as np
!nvidia-smi
notebook_login()
"""# Speech translation"""
device = "cuda:0" if torch.cuda.is_available() else "cpu"
device="cpu"
pipe = pipeline( "automatic-speech-recognition", model="openai/whisper-large-v2", device=device )
dataset = load_dataset("facebook/voxpopuli", "en", split="validation", streaming=True) sample = next(iter(dataset)) print(sample)
Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])
def translate(audio): outputs = pipe(audio, max_new_tokens=256, generate_kwargs={"task": "transcribe","language":"chinese"}) return outputs["text"]
translate(sample["audio"].copy())
"""# Text-to-speech"""
model = BarkModel.from_pretrained("suno/bark-small") processor = AutoProcessor.from_pretrained("suno/bark")
device = "cuda:0" if torch.cuda.is_available() else "cpu" model = model.to(device)
def synthesise(text_prompt,voice_preset="v2/zh_speaker_1"): inputs = processor(text_prompt, voice_preset=voice_preset) speech_output = model.generate(**inputs.to(device)) synthesised_rate = model.generation_config.sample_rate return synthesised_rate,speech_output
target_dtype = np.int16 max_range = np.iinfo(target_dtype).max
def speech_to_speech_translation(audio,voice_preset="v2/zh_speaker_1"): translated_text = translate(audio) synthesised_rate,synthesised_speech = synthesise(translated_text,voice_preset) synthesised_speech = (synthesised_speech.cpu().numpy() * max_range).astype(np.int16) return synthesised_rate,synthesised_speech
synthesised_rate,synthesised_speech = speech_to_speech_translation(sample["audio"],"v2/zh_speaker_1")
Audio(synthesised_speech, rate=synthesised_rate)
import gradio as gr
demo = gr.Blocks()
mic_translate = gr.Interface( fn=speech_to_speech_translation, inputs=gr.Audio(source="microphone", type="filepath"), outputs=gr.Audio(label="Generated Speech", type="numpy"), )
file_translate = gr.Interface( fn=speech_to_speech_translation, inputs=gr.Audio(source="upload", type="filepath"), outputs=gr.Audio(label="Generated Speech", type="numpy"), )
with demo: gr.TabbedInterface([mic_translate, file_translate], ["Microphone", "Audio File"])
demo.launch(debug=True)
Screenshot
Logs
System Info
Severity
Blocking usage of gradio