scambier / obsidian-text-extractor

A (companion) plugin to facilitate the extraction of text from images (OCR) and PDFs.
GNU General Public License v3.0
349 stars 19 forks source link

[BUG] The search does not work in Russian. #1

Closed Fertion closed 1 year ago

Fertion commented 1 year ago

Problem description:

The search does not work in russian. In the settings "rus" is added. At the same time, when I do a search in english and see the text in russian in the search results, I see that it is displayed in Latin characters, and not in Cyrillic characters, as it should be.

For example, instead of the russian word "Депозитарий" the text is recognized as "Jenosutapuu", i.e. the words are clearly not recognized in the correct language. image image

Your environment:

scambier commented 1 year ago

This should be fixed with the latest release, you may have to clear the cache.

Fertion commented 1 year ago

Installed 0.4.0, but it still does not work. Removed the cache data, also completely uninstalled the plugin, and then installed anew, but it did not help.

Also does not work with all languages using the characters in Cyrillic: Urkaine, Belorussian and others.

Here's another example of using the function "Extract Text into a new note".

image

Fertion commented 1 year ago

I found that the characters in Cyrillic are correctly recognized if you select only one language "rus", but if you add a 2nd language "eng" the problem returns.

scambier commented 1 year ago

The problem is (IMO) Tesseract that sees your screenshot and wrongly assumes that it's English, even though Russian is also in the list. I'll add a contextual action to force a specific language when extracting

Onkitova commented 1 year ago

I think I found a root of the problem and a fix for this and similar issue https://github.com/scambier/obsidian-text-extractor/issues/30 with Chinese.

After reading tesseract.js README and this syntax specifically:

image I tried to manipulate data.json settings file in Text Extractor plugins main folder in such manner:

image and it actually fixed the problem -- now OCR works for two languages at the same time:

sample image _english-russian_

I'm not fluent enough in JS, but my guess is that when plugin passing the argument responsible for the language ("ocrLanguages") to the tesseract.js function, there is an error, and instead of passing "eng+rus" or "eng+chi_tra" it throws an error and then uses "eng" as a fallback argument. I think so, because if we manually set two non-latin languages like this:

image

OCR still manages to read latin English from my sample image above (while corrupting Russian symbols):

image

Please check this out.

scambier commented 1 year ago

Good find.

So what's happening is that when you configure multiple languages like ["eng", "rus"], they are concatenated to "eng+rus" before being passed to Tesseract. But there's a bug, right here:

https://github.com/scambier/obsidian-text-extractor/blob/65d3a5ec5ef48cd083db5ba631ddb374cec153be/lib/src/ocr/ocr-manager.ts#L58-L59

The first line loads the languages "eng+rus" (the concatenated string), but the second line only initializes the first language from the array ["eng", "rus"]. Editing data.json with ["eng+rus"] instead of ["eng", "rus"] effectively bypasses this bug.

scambier commented 1 year ago

@Onkitova @Fertion I've published a new version that should fix this. Can you confirm it's ok for you?

Fertion commented 1 year ago

Yes, the recognition now works in several languages simultaneously. Thank you.

Onkitova commented 1 year ago

@Onkitova @Fertion I've published a new version that should fix this. Can you confirm it's ok for you?

Yes. Thanks!