m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12.66k stars 1.34k forks source link

Any ways to reduce or calibrate the offset of word timeline? #919

Open leinace1001 opened 2 weeks ago

leinace1001 commented 2 weeks ago

My research needs precise match between words and speech. But it seems that the word timeline generated by whisperX has a large offset against the audio file; sometimes even an entire word is excluded. How can I solve this? My audios are typically 2 hours. Sometimes I find the offset is smaller on a short audio. Does it suffers accumulative error with a long audio?