occ-ai / obs-localvocal

OBS plugin for local speech recognition and captioning using AI
https://obsproject.com/forum/resources/localvocal-live-stream-ai-assistant.1769/
GNU General Public License v2.0
359 stars 28 forks source link

Broken transcribing for russian language #59

Closed takezie closed 1 month ago

takezie commented 7 months ago

After voice input: "Это проверка русского языка для... " I got transcribed text: "Это пѐогеѐка ѐууукого џзыка длџ..."

It looks like cyrillic "я" replaced with "џ", "ѐ" for cyrillic "р", "о" for cyrillic "н", "у" for cyrillic "с" and so on.

royshil commented 7 months ago

thanks for the issue report. are you on Windows?

takezie commented 7 months ago

Yes. Microsoft Windows [Version 10.0.22631.2715]

royshil commented 7 months ago

@takezie can you perhaps give me a recording of audio that produces this problem so i can test on my end? i know some Russian but not good enough to effectively debug

takezie commented 7 months ago

@royshil Here is set of pangramms - every phrase contains full set of cyrillic characters.

mp3: google.drive

transcribing:

А ещё хорошо бы уметь всем на зависть чётко и наглядно писать буквы и цифры.

Аэрофотосъёмка ландшафта уже выявила земли богачей и процветающих крестьян.

Бегом марш! У месторождения кварцующихся фей без слёз хочется электрическую пыль.

Безмозглый широковещательный цифровой передатчик сужающихся экспонент.

Блеф разъедает ум, чаще цыгана живёшь беспокойно, юля — грех это!

В чащах юга жил бы цитрус? Да, но фальшивый экземпляр!

Вопрос футбольных энциклопедий замещая чушью: эй, где съеден ёж?

Всё ускоряющаяся эволюция компьютерных технологий предъявила жёсткие требования к производителям как собственно вычислительной техники, так и периферийных устройств.

Вступив в бой с шипящими змеями — эфой и гадюкой, — маленький, цепкий, храбрый ёж съел их.

Государев указ: душегубцев да шваль всякую высечь, да калёным железом по щекам этих физиономий съездить!

Друг мой эльф! Яшке б свёз птиц южных чащ!

Завершён ежегодный съезд эрудированных школьников, мечтающих глубоко проникнуть в тайны физических явлений и химических реакций.

royshil commented 7 months ago

works for me... image

takezie commented 7 months ago

Tried on another PC, same problem. Screenshot_1

Very stange, ok, I'll try to build it on my PC, may be something wrong with installed locales...

royshil commented 7 months ago

so this image you attached is wrong? it looks like the Cyrillic letters are showing up... are there specific letters that have problems?

takezie commented 7 months ago

Your variant also have wrong characters, but in another way.

I tried to build on my PC, got same problem, I'm getting atm:

Original: А ещё хорошо бы уметь всем на зависть.

Transcribed: А еще хоѐошо было уметќ гуем оа загџзќ.

ѐ \xD1\x90 should be \xD1\x80 ќ \xD1\x9C should be \xD1\x8C and so on...

it looks like its error for x90...x9A range, but then things get weirder In гуем, у \xD1\x83 shoud be \xD1\x81 but in уметќ have same code and right transcribing.

And at your sample, уметќ broken, right spelling уметь, and it decodes in same way for both of us, but хоѐошо decoded wrong for me, and correct for you.

I don't understand how this is possible.

takezie commented 7 months ago

@royshil could you take a look at this PR, that probably solve the same problem? I'm not strong with cpp, but may be it will be useful?

github.com/ggerganov/whisper.cpp/pull/1313

royshil commented 7 months ago

@takezie yes ive seen it. i have my own fix which i think is more complete https://github.com/occ-ai/obs-localvocal/blob/master/src/transcription-filter.cpp#L249 however it looks like there's a bit more work needed it's not just Russian that's affected by this Whisper.cpp bug. i've had people say Polish, Greek, Chinese, Korean.. so anything i do here needs to support all languages

royshil commented 1 month ago

stale