Why do the numbers in the ASR results not have a start and end timestamp?

m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

BSD 2-Clause "Simplified" License

12.13k stars 1.28k forks source link

Why do the numbers in the ASR results not have a start and end timestamp? #911

Open hpjang opened 2 days ago

hpjang commented 2 days ago

you can see 1462 dosen't have start and end's time stamp

rkulyassa commented 1 day ago

See #314, #717, #789, #792, ...

Transcript words which do not contain characters in the alignment models dictionary e.g. "2014." or "£13.60" cannot be aligned and therefore are not given a timing.

The solution is to pass --suppress_numerals or suppress_numerals=True.

randyburden commented 1 day ago

To provide a counter solution, if you want to keep the numerals, instead of suppressing them, and not have it convert a numeral such as 7 into seven, then you can run some post-processing logic to look at the timestamp for the word before and after the numeral to fill-in the missing timestamp values for the numeral. This is the strategy we use and it works very well.

rkulyassa commented 1 day ago

@randyburden That was actually my naive approach as well. Though what you describe may become problematic if the numeral is located at the beginning or end of the sentence, and say, you want to partition the audio there. Then you enter magic number territory with having to determine offsets and etc. I guess it ultimately depends on your use case