m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.96k stars 1.26k forks source link

timestamps for number #314

Open Atefeh197 opened 1 year ago

Atefeh197 commented 1 year ago

Hi everyone

I just use whisperX, it has more accurate timestamps than whisper, but it is very inaccurate in the case of numbers.

For example, the text for the first row is "Good day." and it extracts accurate timestamps for each word. But the second row is "780 802", you see the start and end times are very close to each other furthermore we do not have time for "780" and "802" separately.

{'start': 2.861, 'end': 4.081, 'text': 'Good day.', 'words': [{'word': 'Good', 'start': 2.861, 'end': 3.281, 'score': 0.587}, {'word': 'day.', 'start': 3.301, 'end': 4.001, 'score': 0.48}]},

{'start': 20.025, 'end': 20.045, 'text': '780 802', 'words': [{'word': '780'}, {'word': '802'}]},

How I can get the better timestamps for numbers?

m-bain commented 1 year ago

yes, this is a limitation of whisperx, it is unable to provide word timestamps for numerals. You can avoid this by transcribing with --supress_numerals flag, this will transcribe numbers literally e.g. "780" -> "seven hundred and eighty". You could then use a text normalizer to convert this back to text

sorgfresser commented 1 year ago

Regarding the normalization afterwards, there are libraries like text2num but they don't support many languages. Maybe yours is supported.

snoop2head commented 7 months ago

@robvanson I think the issue is related to #717

snoop2head commented 7 months ago

@m-bain How can I pass supress_numerals=True when in python interface?

snoop2head commented 7 months ago

I found the method on the issue #629 and code snippet below. https://github.com/m-bain/whisperX/blob/78dcfaab51005aa703ee21375f81ed31bc248560/whisperx/asr.py#L259-L332

Thanks!

robvanson commented 7 months ago

@snoop2head "--suppress_numerals" works for me, thanks a lot.

villesau commented 2 weeks ago

Would there be a possibility to implement something that transforms numerics to text where it is needed and reverts back to numeric when the timestamp is set? This way we would get the original numeric value with timestamps in place.