m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.54k stars 1.21k forks source link

How to enable VAD feature in python usage #170

Open johnchienbronci opened 1 year ago

johnchienbronci commented 1 year ago

The sample code seems unable to use VAD, is that correct? If true, how can I do it, please?

BarfingLemurs commented 1 year ago

--vad_filter True right?

johnchienbronci commented 1 year ago

"--vad_filter" can only be used in the CLI. I want to use WhisperX with VAD enabled through Python, not through CLI operations.

m-bain commented 1 year ago

I will add this to documentation but approximately as so (assuming your audio file is .wav format)

from whisper import load_model
from whisperx import load_align_model, load_vad_model, transcribe_with_vad, align
import gc

device="cuda"
audio_path = "/path/to/your/audio.wav"

vad_model = load_vad_model(torch.device(device), vad_onset, vad_offset)
model = load_model(model_name, device=device)
result = transcribe_with_vad(model, audio_path, vad_model, temperature=temperature, **args)

# Unload Whisper and VAD
del model
del vad_model
gc.collect()
torch.cuda.empty_cache()

align_language = result.get("language", "en")
align_model, align_metadata = load_align_model(align_language, device, model_name=align_model)
result_aligned = align(result["segments"], align_model, align_metadata, audio_path, device)
johnchienbronci commented 1 year ago

Thank you for your reply. Does the audio parameter support waveform? (not audio_path)