gradio-app / gradio

Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!
http://www.gradio.app
Apache License 2.0
30.61k stars 2.27k forks source link

Audio processing requires submitting twice #6924

Open Tejaswgupta opened 5 months ago

Tejaswgupta commented 5 months ago

Describe the bug

When submitting the audio recorded from the browser , clicking on submit for processing ends up with no text , clicking on it again provides with the transcription. Not sure if it's related to my code or something with Gradio(saw the same issue on SeamlessM4t Space)

Have you searched existing issues? πŸ”Ž

Reproduction

import os
import ffmpeg
import gradio as gr
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers.pipelines.audio_utils import ffmpeg_read
from transformers.utils import is_flash_attn_2_available

device = 'cuda' if is_flash_attn_2_available() else 'mps'

current_directory = os.path.dirname(__file__) + '/'

hindi_model_name =  current_directory + 'distil-whisper/training/hindi_models/whisper-large-hi-noldcil/'
hindi_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    hindi_model_name,
    torch_dtype=torch.float16,
    use_safetensors=True
)
hindi_model.to(device)

hindi_processor = AutoProcessor.from_pretrained(
    hindi_model_name,
        )

hindi_model_pipeline = pipeline(
    "automatic-speech-recognition",
    model=hindi_model,
    tokenizer=hindi_processor.tokenizer,
    feature_extractor=hindi_processor.feature_extractor,
    torch_dtype=torch.float16,
    device=device,
    model_kwargs={"use_flash_attention_2": is_flash_attn_2_available()},
    batch_size=24,
    max_new_tokens=80,
    chunk_length_s=25
)

english_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    'distil-whisper/distil-large-v2',
    torch_dtype=torch.float16,
    use_safetensors=True
)

english_model.to(device)

english_processor = AutoProcessor.from_pretrained(
    'distil-whisper/distil-large-v2',
)

english_model_pipeline = pipeline(
    "automatic-speech-recognition",
    model=english_model,
    torch_dtype=torch.float16,
    tokenizer=english_processor.tokenizer,
    feature_extractor=english_processor.feature_extractor,
    device=device,
    model_kwargs={"use_flash_attention_2": is_flash_attn_2_available()},
    batch_size=24,
    max_new_tokens=80,
    chunk_length_s=25
)

def transcribe(audio: str, language):
    if audio is not None:
        if not audio.endswith('.wav'):
            output_file = 'output.wav'
            sample_rate = 16000
            ffmpeg.input(audio).output(output_file, ar=sample_rate).run()
            audio = output_file

        pipe = hindi_model_pipeline if language == "Hindi" else english_model_pipeline
        text = pipe(audio)["text"]
        return text
    else:
        return

demo = gr.Interface(
    transcribe,
    [gr.Audio(type="filepath"), gr.Dropdown(
        choices=['Hindi', 'English'], label="Select Language", value='English'),
     ],
    "text",
)

# demo.launch(share=True,root_path="/lammar_nginx")
demo.launch(share=False, server_name="0.0.0.0", ssl_verify=False,)

Screenshot

No response

Logs

No response

System Info

Gradio Environment Information:
------------------------------
Operating System: Linux
gradio version: 3.46.0
gradio_client version: 0.5.3

------------------------------------------------
gradio dependencies in your environment:

aiofiles: 23.2.1
altair: 5.1.1
fastapi: 0.100.1
ffmpy: 0.3.1
gradio-client==0.5.3 is not installed.
httpx: 0.24.1
huggingface-hub: 0.19.4
importlib-resources: 6.1.0
jinja2: 3.1.2
markupsafe: 2.1.3
matplotlib: 3.7.3
numpy: 1.24.4
orjson: 3.9.7
packaging: 23.2
pandas: 2.0.3
pillow: 10.1.0
pydantic: 1.10.13
pydub: 0.25.1
python-multipart: 0.0.6
pyyaml: 5.3.1
requests: 2.31.0
semantic-version: 2.10.0
typing-extensions: 4.8.0
uvicorn: 0.23.2
websockets: 11.0.3
authlib; extra == 'oauth' is not installed.
itsdangerous; extra == 'oauth' is not installed.

gradio_client dependencies in your environment:

fsspec: 2023.6.0
httpx: 0.24.1
huggingface-hub: 0.19.4
packaging: 23.2
requests: 2.31.0
typing-extensions: 4.8.0
websockets: 11.0.3

Severity

I can work around it

NeonDaniel commented 1 month ago

I have the same issue after updating from ~=3.28 to ~=4.31. I see that None is passed the first time, and the valid file path the second. I do not see a valid workaround.

        with self.chat_ui as blocks:
            client_session = gradio.State(self._start_session())
            client_session.attach_load_event(self._start_session, None)
            # Define primary UI
            blocks.title = title
            chatbot = gradio.Chatbot(label=chatbot_label)
            with gradio.Row():
                textbox = gradio.Textbox(label=text_label,
                                         placeholder=placeholder,
                                         scale=7)
                audio_input = gradio.Audio(sources=["microphone"],
                                           type="filepath",
                                           label=speech,
                                           editable=False,
                                           scale=3)
                submit = gradio.Button(value="Submit",
                                       variant="primary")
            LOG.debug("Created input elements")
            tts_audio = gradio.Audio(autoplay=True, visible=False)
            LOG.debug("Created audio element")
            submit.click(self.on_user_input,
                         inputs=[textbox, chatbot, audio_input,
                                 client_session],
                         outputs=[chatbot, client_session, textbox,
                                  audio_input, tts_audio])
freddyaboulton commented 1 month ago

This is probably related to the fact that the audio is being uploaded to the server while you click submit. So if you were to click submit while the file is uploaded, the value passed to your function would be None.