m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.3k stars 1.18k forks source link

Aligned timestamp for the last phoneme is too long for Dutch language #749

Open ShoAkamine opened 5 months ago

ShoAkamine commented 5 months ago

I noticed that the "end" timestamp for the last phoneme is often off for the Dutch language, leading to the duration of the last phoneme being way longer than the actual utterance.

As you can see in the screenshot below, the "end" timestamp for the character "t", which is the last phoneme of the segment, is estimated to be "117.112", which makes the duration of the phoneme about 5 seconds long. The actual end timestamp was somewhere around "112.400".

image

Here's another example. In this case, the last phoneme "a" is estimated to be about 4 seconds long. The actual end timestamp was "39.410".

image

Could anyone assist me with how to solve this issue? Thank you!

elimisteve commented 3 months ago

Hi @ShoAkamine, would you mind telling us how well Dutch language transcription is working for you in general? Thanks!

ShoAkamine commented 3 months ago

Hi @elimisteve, I don't have an objective measure of how well the transcription is working for Dutch, but the accuracy of transcription is very good! I would say 90+% of the times, the transcript doesn't need to be fixed (although if you are interested in conversational words such as "uhm", then you need to add them yourself).

However, as I raised in this thread, the timestamp accuracy for the end of the segment is usually off. I'm now trying to see if getting timestamps from WebMOUS by submitting the transcript generated by WhisperX works better than doing the forced alignment with WhisperX.

elimisteve commented 3 months ago

@ShoAkamine 90%+ sounds great! Do you remember which model you used for your Dutch transcriptions? large-v2? v3? Thank you much :smile:

ShoAkamine commented 3 months ago

I used large-v3 model. But as I posted earlier, the end timestamp for the last character is often off (same pattern observed for German and Japanese too). This seems to happen when the model cannot estimate the timestamp, in which case it uses the start timestamp of the next character. As a temporal solution, I add x seconds (e.g., 0.05 seconds) if the last character's duration is longer than a threshold (let's say 1 second).