m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12k stars 1.26k forks source link

Some of the words in word-segments don't have start or end times #253

Open NielsVandenEynde opened 1 year ago

NielsVandenEynde commented 1 year ago

This is something that changed with the latest version. Some short words such as numbers just don't have timestamps. This is annoying because I'm parsing those words into sentences myself.

MasterTemple commented 1 year ago

I have the same issue for all numeric values in a transcript. This includes punctuation that is stored with the value. Examples: 1, 2,, 3?, 4., and 2,000.

xorlof commented 1 year ago

This is covered in the readme under the Limitations heading, "Transcript words which do not contain characters in the alignment models dictionary e.g. "2014." or "£13.60" cannot be aligned and therefore are not given a timing."

7k50 commented 1 year ago

Would it be possible to include an option for giving these words some sort of timestamp, even if approximate or "neighboring"? Like the OP, I am parsing individual words into sentences, and this is aided by standardization of the format.

m-bain commented 1 year ago

I see, yes it's tricky because for some use cases non-aligned word timestamps can be interpolated from its neighbors. For other use cases, dropping them is more suitable - or merging into neighboring words. And I wanted to make some distinction between alignable words rather than make some heuristics for ones that aren't.

At the moment I left it up to the user to do the post-processing. Feel free to push some helper functions which do any of the above seems like there is a need for it

7k50 commented 9 months ago

Are there any hope of getting a solution in place that can highlight numeric words (e.g. via tags in .srt files, as is done for non-numeric words)?

I have spent a large amount of time trying to write a Python program for post-processing of .srt files as aforementioned, but the complexities involved in sentence and word management is beyond my ability, and it would probably make more sense for everyone to have WhisperX provide the ability to approximate the closest timecode for numeric words?