Closed Fertion closed 1 year ago
This should be fixed with the latest release, you may have to clear the cache.
Installed 0.4.0, but it still does not work. Removed the cache data, also completely uninstalled the plugin, and then installed anew, but it did not help.
Also does not work with all languages using the characters in Cyrillic: Urkaine, Belorussian and others.
Here's another example of using the function "Extract Text into a new note".
I found that the characters in Cyrillic are correctly recognized if you select only one language "rus", but if you add a 2nd language "eng" the problem returns.
The problem is (IMO) Tesseract that sees your screenshot and wrongly assumes that it's English, even though Russian is also in the list. I'll add a contextual action to force a specific language when extracting
I think I found a root of the problem and a fix for this and similar issue https://github.com/scambier/obsidian-text-extractor/issues/30 with Chinese.
After reading tesseract.js README and this syntax specifically:
I tried to manipulate data.json settings file in Text Extractor plugins main folder in such manner:
and it actually fixed the problem -- now OCR works for two languages at the same time:
I'm not fluent enough in JS, but my guess is that when plugin passing the argument responsible for the language ("ocrLanguages") to the tesseract.js function, there is an error, and instead of passing "eng+rus" or "eng+chi_tra" it throws an error and then uses "eng" as a fallback argument. I think so, because if we manually set two non-latin languages like this:
OCR still manages to read latin English from my sample image above (while corrupting Russian symbols):
Please check this out.
Good find.
So what's happening is that when you configure multiple languages like ["eng", "rus"]
, they are concatenated to "eng+rus"
before being passed to Tesseract. But there's a bug, right here:
The first line loads the languages "eng+rus"
(the concatenated string), but the second line only initializes the first language from the array ["eng", "rus"]
. Editing data.json with ["eng+rus"]
instead of ["eng", "rus"]
effectively bypasses this bug.
@Onkitova @Fertion I've published a new version that should fix this. Can you confirm it's ok for you?
Yes, the recognition now works in several languages simultaneously. Thank you.
@Onkitova @Fertion I've published a new version that should fix this. Can you confirm it's ok for you?
Yes. Thanks!
Problem description:
The search does not work in russian. In the settings "rus" is added. At the same time, when I do a search in english and see the text in russian in the search results, I see that it is displayed in Latin characters, and not in Cyrillic characters, as it should be.
For example, instead of the russian word "Депозитарий" the text is recognized as "Jenosutapuu", i.e. the words are clearly not recognized in the correct language.
Your environment: