linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.01k stars 156 forks source link

Word level output is combined for Languages that don't use spaces #34

Closed kamranjon closed 1 year ago

kamranjon commented 1 year ago

Japanese is a good example, here is a single word output:

{"text"=>"いきますニュースタブでのサイトメイク表記が実際と違う", "start"=>0.02, "end"=>4.18, "confidence"=>0.719}

Many words are combined together. Here is an example audio to test with:

https://user-images.githubusercontent.com/3966239/219478733-ad14e548-8895-4995-9f81-02b761293a61.mp4

Jeronymous commented 1 year ago

Thank you @kamranjon for opening this issue. Indeed there was a bug with "efficient decoding" when the language was detected automatically. This is fixed now.

I was not testing thoroughly with languages like Japanese, and now I added tests, to avoid problems in the future.

About the difference between efficient and naive :

The implementation of the efficient mode is much more tricky, so more prone to bugs (but I would say that it's quite stable now, hoping that you detected the last remaining issue).