My research needs precise match between words and speech. But it seems that the word timeline generated by whisperX has a large offset against the audio file; sometimes even an entire word is excluded. How can I solve this? My audios are typically 2 hours. Sometimes I find the offset is smaller on a short audio. Does it suffers accumulative error with a long audio?
My research needs precise match between words and speech. But it seems that the word timeline generated by whisperX has a large offset against the audio file; sometimes even an entire word is excluded. How can I solve this? My audios are typically 2 hours. Sometimes I find the offset is smaller on a short audio. Does it suffers accumulative error with a long audio?