Incomplete transcription of Non-English audios

m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

BSD 2-Clause "Simplified" License

10.17k stars 1.07k forks source link

Incomplete transcription of Non-English audios #764

Open aayushNB opened 3 months ago

aayushNB commented 3 months ago

I am running WhisperX with large-v3 model.

When an audio is given, the transcription output ignores last 7-8 seconds and gives smaller transcript than the original answer.

I evaluated the model for Non-Engish languages like - Hindi, Kannada, Tamil, Telugu.
I tried givings audios of length 15 seconds and 30 seconds but the same problem arises in both.
The problem doesn't occur with same quality of Audio if the language is English.

I am using below code for inference:


device = "cuda" 
batch_size = 16
compute_type = "float32"

model = whisperx.load_model("large-v3", device, compute_type=compute_type)
result = model.transcribe(audio, batch_size=batch_size)

SmartManoj commented 3 months ago

Works fine. Could you attach your audio file? audio.zip Got output: மற்ற படங்களைப் பற்றிய செய்திகள் எதுவும் வெளிவரவில்லை.

aayushNB commented 3 months ago

chunk.zip

My output: ' कोई ना कोई आकर एंट पर ले ले लेता था उनकी प्रॉपर्टी को ठीक है तो इन्वेस्टमेंट पर्पर्स के साथ से भी प्रॉपर्टी की थोड़ी सी अंदर है वह मैं भी समझ रहा हूं मैं बिजनेस पर्पर्स के साथ से आपको डिस्प्ले तो बाहर होने से ही आपकी ब'

Transcription is missing last 18 seconds of the audio

For your chunk, I obtained the same transcript but it might be due to smaller duration. And with my audios of smaller length ~5 sec, I receive correct results.

Can you please transcribe the above chunk and match the results?

NegatedObjectIdentity commented 3 months ago

I had a similar issue in German language where parts of sentences were missing. It turned out the the issue is the vad (voice activity detection) model of whisperX. It has hard coded values of 'vad_onset': 0.500 and 'vad_offset': 0.363. For me it worked once I changed this values to 'vad_onset': 0.1 and 'vad_offset': 0.1.

vad_opts = {'vad_onset': 0.1, 'vad_offset': 0.1}
whisperx.load_model(vad_options=vad_opts)