m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
10.17k stars 1.07k forks source link

Incomplete transcription of Non-English audios #764

Open aayushNB opened 3 months ago

aayushNB commented 3 months ago

I am running WhisperX with large-v3 model.

When an audio is given, the transcription output ignores last 7-8 seconds and gives smaller transcript than the original answer.

I am using below code for inference:


device = "cuda" 
batch_size = 16
compute_type = "float32"

model = whisperx.load_model("large-v3", device, compute_type=compute_type)
result = model.transcribe(audio, batch_size=batch_size)
SmartManoj commented 3 months ago

Works fine. Could you attach your audio file? audio.zip Got output: மற்ற படங்களைப் பற்றிய செய்திகள் எதுவும் வெளிவரவில்லை.

aayushNB commented 3 months ago

chunk.zip

My output: ' कोई ना कोई आकर एंट पर ले ले लेता था उनकी प्रॉपर्टी को ठीक है तो इन्वेस्टमेंट पर्पर्स के साथ से भी प्रॉपर्टी की थोड़ी सी अंदर है वह मैं भी समझ रहा हूं मैं बिजनेस पर्पर्स के साथ से आपको डिस्प्ले तो बाहर होने से ही आपकी ब'

For your chunk, I obtained the same transcript but it might be due to smaller duration. And with my audios of smaller length ~5 sec, I receive correct results.

Can you please transcribe the above chunk and match the results?

NegatedObjectIdentity commented 3 months ago

I had a similar issue in German language where parts of sentences were missing. It turned out the the issue is the vad (voice activity detection) model of whisperX. It has hard coded values of 'vad_onset': 0.500 and 'vad_offset': 0.363. For me it worked once I changed this values to 'vad_onset': 0.1 and 'vad_offset': 0.1.

vad_opts = {'vad_onset': 0.1, 'vad_offset': 0.1}
whisperx.load_model(vad_options=vad_opts)