m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.24k stars 1.18k forks source link

OAI Whisper transcribes correctly but whisperx returns `No active speech found in audio` #844

Open reasv opened 1 month ago

reasv commented 1 month ago

I'm getting poor transcription results using whisperx, specifically I am sometimes not getting any transcription out of some short videos, whereas OpenAI's official whisper model transcribes them correctly.

On the OpenAI side, I am using their official HF Space (https://huggingface.co/spaces/openai/whisper) which employs large-v3.

On the whisperx side, I am using Systran/faster-whisper-large-v3 for comparison, with the latest whisperx from github, and pytorch with CUDA on Windows 11 (on an RTX 4090).

Here's the code for the simple gradio UI I use for testing whisperx: https://github.com/reasv/panoptikon/blob/master/src/ui/test_models/whisper.py The transcription function is very simple:

def transcribe_audio(
    model_repo: str | None,
    language: str | None,
    batch_size: int,
    audio_tuple: Tuple[int, np.ndarray] | None,
    audio_file: str | None,
) -> Tuple[str, Tuple[int, np.ndarray] | None]:
    if model_repo is None:
        return "[No model selected]", None
    print(
        f"""
        Transcribing audio with model: {model_repo} \
        and language: {language}
        """
    )

    import torch
    import whisperx

    sample_rate, audio = (
        audio_tuple if audio_tuple is not None else (None, None)
    )

    if audio:
        print(f"Sample rate: {sample_rate}")

    if audio is None and audio_file is not None:
        audio = whisperx.load_audio(audio_file)

    if audio is None:
        return "[No audio provided]", None

    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda"

    whisper_model = whisperx.load_model(
        model_repo,
        device=device,
        language=language,
    )

    result = whisper_model.transcribe(
        audio,
        batch_size=batch_size,
        language=language,
    )
    print(result)
    merged_text = "\n".join([segment["text"] for segment in result["segments"]])
    return merged_text, (whisperx.audio.SAMPLE_RATE, audio)

I am only testing it with audio file paths at the moment, so assume audio_file is populated, and not audio_tuple. The audio seems to be loaded correctly from the video file since I can listen to the extracted audio output to the gradio Audio component.

This is the output I get:

Transcribing audio with model: Systran/faster-whisper-large-v3 and language: en

Q:\projects\panoptikon\.venv\Lib\site-packages\pyannote\audio\core\io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.3.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\[]\.cache\torch\whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.1+cu121. Bad things might happen unless you revert torch to 1.x.
No active speech found in audio
{'segments': [], 'language': 'en'}

Some videos work as expected, others I get No active speech found even though the speech seems relatively clear (and in english).

At the moment I cannot give an example of a video that causes this problem as it's happening with personal videos and I haven't found publicly available videos that reproduce the issue yet.

Any ideas of why this is happening?

reasv commented 1 month ago

I have noticed a pattern: In a video that has this issue, there is a first part of clear speech, and then loud noises. If I cut the last part with loud noises out, the transcription works correctly.

So, for some reason later noise prevents earlier speech from being transcribed? Still, OpenAI Whisper transcribes everything correctly, including speech during the part with loud noises.

reasv commented 1 month ago

I have the same problem when using the new batched backend for faster_whisper. So perhaps batching is at fault here. This is despite the videos that have problems being too short for batching (eg, 8s)

BBC-Esq commented 1 month ago

I have noticed a pattern: In a video that has this issue, there is a first part of clear speech, and then loud noises. If I cut the last part with loud noises out, the transcription works correctly.

So, for some reason later noise prevents earlier speech from being transcribed? Still, OpenAI Whisper transcribes everything correctly, including speech during the part with loud noises.

Good delving into the issue, thanks.

reasv commented 1 month ago

As I mentioned in my comment on the faster_whisper PR, I have the same problem when enabling batching on faster_whisper, but the issue disappears when not using the batched pipeline (on faster_whisper) https://github.com/SYSTRAN/faster-whisper/pull/856

reasv commented 1 month ago

I found a video on the internet that replicates this problem. Audio: https://litter.catbox.moe/kyu2q8.wav

MahmoudAshraf97 commented 1 month ago

@reasv can you reupload the video to a permanent storage and share the link?

ncuxzy commented 2 weeks ago

@reasv can you reupload the video to a permanent storage and share the link?您可以将视频重新上传到永久存储并分享链接吗?

https://drive.google.com/file/d/1JKsYQZYQDrKuRFciFhh1aA5ftAGr-eud/view?usp=sharing,this video can replicates this problem

seanco-hash commented 1 week ago

Hi, Anyone found a solution? I noticed that the old version medium model works fine but the new version medium and large-v3 has these problems

MahmoudAshraf97 commented 1 week ago

Hi, Anyone found a solution? I noticed that the old version medium model works fine but the new version medium and large-v3 has these problems

The problem is with pyannote vad model, https://github.com/SYSTRAN/faster-whisper/pull/936 is a possible solution, but you have to use faster-whisper for transcription