Word level output is combined for Languages that don't use spaces

linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence

GNU Affero General Public License v3.0

2.01k stars 156 forks source link

Japanese is a good example, here is a single word output:

{"text"=>"いきますニュースタブでのサイトメイク表記が実際と違う", "start"=>0.02, "end"=>4.18, "confidence"=>0.719}

Many words are combined together. Here is an example audio to test with:

https://user-images.githubusercontent.com/3966239/219478733-ad14e548-8895-4995-9f81-02b761293a61.mp4

Update We are noticing that this is a situation where language_detection does not occur properly inside _transcribe_timestamped_efficient() but does work well with _transcribe_timestamped_naive() - based on logging inside should_use_space() it seems switching to naive fixes the issue (when using efficent, the language is detected as en and subsequently the incorrect spacing var is used). Could you explain the difference between the two (efficient/naive)?

Thank you @kamranjon for opening this issue. Indeed there was a bug with "efficient decoding" when the language was detected automatically. This is fixed now.

I was not testing thoroughly with languages like Japanese, and now I added tests, to avoid problems in the future.

About the difference between efficient and naive :

with "efficient", the prediction of the word timestamps is done on the fly, while whisper model is decoding.
with "naive", we let the whisper model decode everything first, then we go over all the detected segment and perform the alignment for each (running whisper model inference again, which is suboptimal).

The implementation of the efficient mode is much more tricky, so more prone to bugs (but I would say that it's quite stable now, hoping that you detected the last remaining issue).

linto-ai / whisper-timestamped

Word level output is combined for Languages that don't use spaces #34