Open ShoAkamine opened 8 months ago
Hi @ShoAkamine, would you mind telling us how well Dutch language transcription is working for you in general? Thanks!
Hi @elimisteve, I don't have an objective measure of how well the transcription is working for Dutch, but the accuracy of transcription is very good! I would say 90+% of the times, the transcript doesn't need to be fixed (although if you are interested in conversational words such as "uhm", then you need to add them yourself).
However, as I raised in this thread, the timestamp accuracy for the end of the segment is usually off. I'm now trying to see if getting timestamps from WebMOUS by submitting the transcript generated by WhisperX works better than doing the forced alignment with WhisperX.
@ShoAkamine 90%+ sounds great! Do you remember which model you used for your Dutch transcriptions? large-v2? v3? Thank you much :smile:
I used large-v3 model. But as I posted earlier, the end timestamp for the last character is often off (same pattern observed for German and Japanese too). This seems to happen when the model cannot estimate the timestamp, in which case it uses the start timestamp of the next character. As a temporal solution, I add x seconds (e.g., 0.05 seconds) if the last character's duration is longer than a threshold (let's say 1 second).
I encountered the same issue using the large-v3 model with Japanese. For each か at the end of a sentence, the alignment model seems to set end
to the start
of the next word, causing multiple seconds to be allotted to this one syllable.
[{'end': 2.413, 'score': 0.965, 'start': 2.233, 'word': '去'},
{'end': 2.653, 'score': 1.0, 'start': 2.413, 'word': '年'},
{'end': 2.813, 'score': 0.998, 'start': 2.653, 'word': 'の'},
{'end': 3.113, 'score': 0.999, 'start': 2.813, 'word': '誕'},
{'end': 3.353, 'score': 1.0, 'start': 3.113, 'word': '生'},
{'end': 3.493, 'score': 0.917, 'start': 3.353, 'word': '日'},
{'end': 4.273, 'score': 1.0, 'start': 3.493, 'word': 'に'},
{'end': 4.574, 'score': 1.0, 'start': 4.273, 'word': '何'},
{'end': 4.774, 'score': 0.901, 'start': 4.574, 'word': 'を'},
{'end': 4.874, 'score': 1.0, 'start': 4.774, 'word': 'も'},
{'end': 4.994, 'score': 0.995, 'start': 4.874, 'word': 'ら'},
{'end': 5.094, 'score': 1.0, 'start': 4.994, 'word': 'い'},
{'end': 5.234, 'score': 1.0, 'start': 5.094, 'word': 'ま'},
{'end': 5.314, 'score': 1.0, 'start': 5.234, 'word': 'し'},
{'end': 5.474, 'score': 1.0, 'start': 5.314, 'word': 'た'},
{'end': 9.655, 'score': 1.0, 'start': 5.474, 'word': 'か'},
{'end': 9.795, 'score': 1.0, 'start': 9.655, 'word': '家'},
{'end': 10.095, 'score': 0.999, 'start': 9.795, 'word': '族'},
{'end': 10.295, 'score': 1.0, 'start': 10.095, 'word': 'の'},
{'end': 10.555, 'score': 1.0, 'start': 10.295, 'word': '誕'},
{'end': 10.755, 'score': 1.0, 'start': 10.555, 'word': '生'},
{'end': 10.915, 'score': 1.0, 'start': 10.755, 'word': '日'},
{'end': 11.735, 'score': 1.0, 'start': 10.915, 'word': 'に'},
{'end': 12.095, 'score': 0.968, 'start': 11.735, 'word': '何'},
{'end': 12.355, 'score': 0.927, 'start': 12.095, 'word': 'を'},
{'end': 12.495, 'score': 1.0, 'start': 12.355, 'word': 'あ'},
{'end': 12.635, 'score': 1.0, 'start': 12.495, 'word': 'げ'},
{'end': 12.755, 'score': 1.0, 'start': 12.635, 'word': 'ま'},
{'end': 12.855, 'score': 1.0, 'start': 12.755, 'word': 'し'},
{'end': 12.995, 'score': 1.0, 'start': 12.855, 'word': 'た'},
{'end': 16.856, 'score': 0.995, 'start': 12.995, 'word': 'か'},
{'end': 17.136, 'score': 1.0, 'start': 16.856, 'word': '友'},
{'end': 17.416, 'score': 1.0, 'start': 17.136, 'word': '達'},
{'end': 17.556, 'score': 1.0, 'start': 17.416, 'word': 'の'},
...
I noticed that the "end" timestamp for the last phoneme is often off for the Dutch language, leading to the duration of the last phoneme being way longer than the actual utterance.
As you can see in the screenshot below, the "end" timestamp for the character "t", which is the last phoneme of the segment, is estimated to be "117.112", which makes the duration of the phoneme about 5 seconds long. The actual end timestamp was somewhere around "112.400".
Here's another example. In this case, the last phoneme "a" is estimated to be about 4 seconds long. The actual end timestamp was "39.410".
Could anyone assist me with how to solve this issue? Thank you!