gradio-app / gradio

Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!
http://www.gradio.app
Apache License 2.0
32.5k stars 2.44k forks source link

Audio streaming issue for iphone #6835

Closed sumeetmahesh closed 3 weeks ago

sumeetmahesh commented 9 months ago

Describe the bug

When a yield function is used to pass chunks of audio to Gradio audio output in streaming mode, the Gradio audio playback works fine on desktop (windows) and Android phones. However, the playback fails on iphone with an error status.

If instead of yielding chunks of audio, the code waits to have the fully audio created before passing it to the Gradio audio output, the audio playback works fine on iphone. However, this approach is not desirable due to lower service quality i.e. longer wait time before audio is played for the user.

Have you searched existing issues? 🔎

Reproduction

!pip install --quiet transformers datasets accelerate==0.20.3 gradio SentencePiece

from transformers import pipeline
import torch
from transformers.pipelines.audio_utils import ffmpeg_microphone_live

device = "cuda:0" if torch.cuda.is_available() else "cpu"

transcriber = pipeline(
    "automatic-speech-recognition", model="openai/whisper-base.en", device=device
)

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts").to(device)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(device)

from datasets import load_dataset

embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

def synthesise(text):
    inputs = processor(text=text, return_tensors="pt")
    speech = model.generate_speech(
        inputs["input_ids"].to(device), speaker_embeddings.to(device), vocoder=vocoder
    )
    return speech.cpu()

import numpy as np

target_dtype = np.int16
max_range = np.iinfo(target_dtype).max

def speech_to_speech_assist(audio):
    transcribe_text = transcriber(audio, generate_kwargs={"max_new_tokens": 1024})["text"]
    print(f"Transcribed text: {transcribe_text}")

# Yield function issue - Comment out the code in the block below
    temp = ""
    for i, new_text in enumerate(transcribe_text.split()):
        temp += new_text
        if (i+1)%5 == 0:
          synthesised_speech = synthesise(temp)
          synthesised_speech = (synthesised_speech.numpy() * max_range).astype(np.int16)
          yield 16000, synthesised_speech
          temp = ""
    synthesised_speech = synthesise(temp)
    synthesised_speech = (synthesised_speech.numpy() * max_range).astype(np.int16)
    yield 16000, synthesised_speech

# Retun function for debugging - Uncomment out the codee below 
    # synthesised_speech = synthesise(transcribe_text)
    # synthesised_speech = (synthesised_speech.numpy() * max_range).astype(np.int16)
    # return 16000, synthesised_speech

import gradio as gr

with gr.Blocks() as demo:
  input_audio = gr.Audio(sources=["microphone"], type="filepath")
  output_audio=gr.Audio(type="numpy", autoplay= True, streaming=True,)
  input_audio.stop_recording(speech_to_speech_assist, input_audio, output_audio).then(lambda:None, None, input_audio, queue=False)

if __name__ == "__main__":
    demo.launch(share=True, debug=True)

Screenshot

No response

Logs

No response

System Info

Gradio Environment Information:
------------------------------
Operating System: Linux
gradio version: 4.10.0
gradio_client version: 0.7.3

------------------------------------------------
gradio dependencies in your environment:

aiofiles: 23.2.1
altair: 4.2.2
fastapi: 0.105.0
ffmpy: 0.3.1
gradio-client==0.7.3 is not installed.
httpx: 0.25.2
huggingface-hub: 0.19.4
importlib-resources: 6.1.1
jinja2: 3.1.2
markupsafe: 2.1.3
matplotlib: 3.7.1
numpy: 1.23.5
orjson: 3.9.10
packaging: 23.2
pandas: 1.5.3
pillow: 9.4.0
pydantic: 2.5.2
pydub: 0.25.1
python-multipart: 0.0.6
pyyaml: 6.0.1
semantic-version: 2.10.0
tomlkit==0.12.0 is not installed.
typer: 0.9.0
typing-extensions: 4.9.0
uvicorn: 0.24.0.post1
authlib; extra == 'oauth' is not installed.
itsdangerous; extra == 'oauth' is not installed.

gradio_client dependencies in your environment:

fsspec: 2023.6.0
httpx: 0.25.2
huggingface-hub: 0.19.4
packaging: 23.2
typing-extensions: 4.9.0
websockets: 11.0.3

Severity

Blocking usage of gradio

bigr00 commented 8 months ago

I’m developing software for clients that showcase it at conferences and we use iPads to run Gradio apps. Right now audio streaming is broken for all iOS devices - it says “Error” and draws a yellow box around the component. This happens in Safari and Chrome.

The following code is the “stream_audio_out” demo notebook ran on an iPad: IMG_0302

The impact is huge - I am considering moving away from Gradio, as I had to engineer a workaround to segment the audio to multiple files and trigger the initial start of the stream with a hidden component (changing it’s default value once) and then load a new file with Audio.stop() event every time one file finishes.

Example of what is needed to be done in order to stream audio on iOS devices right now:

with gr.Blocks() as app:
    audio_input = gr.Audio(label="Record or upload Audio", type="filepath", interactive=True, format='mp3')
    hidden_textbox = gr.Textbox(value="someValue", visible=False)
    start_button = gr.Button("Start", elem_id="start_btn", variant="primary")

    audio_output = gr.Audio(label="Output Audio", type="filepath", autoplay=True)
    text_output = gr.Textbox(label="Output text")

    # Returning an empty string so it changes the value of the hidden textbox only once.
    text_output.change(fn=text_changed, inputs=[text_output], outputs=[hidden_textbox])
    start_button.click(fn=process_input, inputs=[audio_input], outputs=[text_output])

    # This triggers the initial audio start
    hidden_textbox.change(fn=start_audio, outputs=[audio_output])

    # This loads the next audio segment when the previous is finished
    audio_output.stop(fn=on_finished_playing_audio, inputs=[], outputs=[audio_output])

There are no errors in the js console nor python output that I can see.

abidlabs commented 2 months ago

I believe this should have been fixed with https://github.com/gradio-app/gradio/issues/6835, @sumeetmahesh if you'd like to confirm.

freddyaboulton commented 2 months ago

Forgot to comment about this in #8906 - i think iphone safari does not like autoplay (https://stackoverflow.com/questions/43570460/html5-video-autoplay-on-iphone) but if you manually click play, you will hear the audio as soon as it's ready. Looking into it.

sumeetmahesh commented 2 months ago

thanks @freddyaboulton !

Now the yield function to pass chunks of audio to Gradio audio output in streaming mode is working fine for iPhone and Android phones. However, the playback now fails to play audio on desktop (chrome and Edge) with no error status.

> import numpy as np
> import gradio as gr
> import librosa
> import soundfile as sf
> 
> SEGMENT_DURATION_SEC = 2
> 
> def splitter(audio):
>   print(audio)
>   wave, sr = librosa.load(audio, sr=None)
>   duration = librosa.get_duration(y=wave, sr=sr)
> 
>   if duration > SEGMENT_DURATION_SEC:
>       segment_length = sr * SEGMENT_DURATION_SEC
>       num_sections = int(np.ceil(len(wave) / segment_length))
>       split = []
>       print(f'Number of Sections:{num_sections}')
>       for i in range(num_sections):
>         t = wave[i * segment_length: ((i + 1) * segment_length)-1]
>         split.append(t)
> 
>       for i in range(num_sections):
>         audio_out = f"{audio[:-4]}{i}_out.wav"
>         sf.write(audio_out, split[i], sr)
>         yield audio_out
> 
> gr.Interface(splitter,
>              gr.Audio(sources=["microphone"], type="filepath"),
>              gr.Audio(streaming=True, autoplay=True, type="filepath")
> ).launch(share=True, debug=True)
freddyaboulton commented 1 month ago

Hi @sumeetmahesh - sorry for the delay. I am not sure why it would work for iphone but not otherwise.

abidlabs commented 3 weeks ago

Going to close this as I don't think we have a clear path to repro. If we have the code, environment, and audio files needed to repro this issue, we can reopen