microsoft / PowerToys

Windows system utilities to maximize productivity
MIT License
109.11k stars 6.43k forks source link

Using Text Extractor to extract Japanese Hiragana/Katakana contains spaces #22208

Closed kayoyoyak closed 1 year ago

kayoyoyak commented 1 year ago

Microsoft PowerToys version

0.64.0

Installation method

PowerToys auto-update

Running as admin

No

Area(s) with issue?

TextExtractor

Steps to reproduce

1.click win+shift+t 2.capture screens containing Japanese text Example (red framed area): image

In the commit https://github.com/microsoft/PowerToys/commit/d17ac2bf790c75b837888ce4212e39511d0037b9 , the CJK Kanji part was supported, but for Japanese, other character blocks need to be supported as well. Character blocks to be supported: 3000 ~ 303F IsCJKSymbolsandPunctuation 3040 ~ 309F IsHiragana 30A0 ~ 30FF IsKatakana 4E00 ~ 9FFF IsCJKUnifiedIdeographs (Supported by the above commit)

✔️ Expected Behavior

テキスト抽出子は、OCRパックがインストールされている言語のみを認識できます。

❌ Actual Behavior

テ キ ス ト 抽出子 は 、 OCR パ ッ ク が イ ン ス ト ー ル さ れ て い る 言語 の み を 認識 で き ま す 。

Other Software

No response

Dub1shu commented 1 year ago

As @kayoyoyak said, I think adding katakana, hiragana, CJKSymbolsAndPunctuation will solve the issue of spaces being inserted between Japanese characters.

CJK stands for Chinese, Japanese and Korean. In Chinese and Japanese, the OCR engine inserts spaces between all characters. So without a workaround, Text Extractor inserts spaces between every character. For Korean, the OCR engine does not insert spaces between Hangul characters, it inserts spaces as needed. This is because spaces have a grammatical meaning in Korean.

So for this issue, I think a simple fix of adding katakana and hiragana to the Regex should be enough. Need to create a pull request?

kayoyoyak commented 1 year ago

@Dub1shu I see, I agree with your approach.

As a side note, Unicode 3000 to 303F contain symbols used in Japanese, which do not require spaces as Japanese. However, I am not sure if they are needed for Chinese and Korean. Therefore, I think it is sufficient to only add regular expressions for hiragana and katakana for this workaround.

Dub1shu commented 1 year ago

I agree to add only hiragana and katakana. Punctuation issues aren't that big of an issue, so it might be better to wait for the OCR engine improvement than PowerToys.

Let's suggest adding hiragana and katakana. Could you create a PR?

kayoyoyak commented 1 year ago

Sorry, I have no experience with Windows apps or GitHub. So I don't know what to do :(

lamrongol commented 1 year ago

I want to fix this problem, too. However, @Dub1shu I've read https://github.com/microsoft/PowerToys/tree/main/doc/devdocs and followed instructions but CS0246 error occurs at using interop;.

Edit: 2022/12/03 19:14 I wrote replacing code as following, the core code is only two lines and could you add this?

using System.Text.RegularExpressions;

string str = "テ キ ス ト 抽出子 は 、 OCR パ ッ ク が イ ン ス ト ー ル さ れ て い る 言語 の み を 認識 で き ま す 。";

Regex regexKanaPunctuation = new Regex("\\s?([\u3040-\u309F\u30A0-\u30FF\uFF61-\uFF9F、。])\\s?");
string result = regexKanaPunctuation.Replace(str, "$1");

Console.WriteLine(result);//テキスト抽出子は、OCRパックがインストールされている言語のみを認識できます。

@kayoyoyak https://github.com/microsoft/PowerToys/tree/main/doc/devdocs and this Japanese page https://qiita.com/y-vectorfield/items/b955617712f3b66359f2 may help you.

Dub1shu commented 1 year ago

@kayoyoyak OK. I will work on it.

@lamrongol I pulled the latest version, compiled it, and it built fine in my environment. I will try to create a PR to fix this.

Dub1shu commented 1 year ago

I looked into this issue. This issue where spaces are inserted in Japanese is fixed in #20415. As of #20415, it works fine in Japanese.

But #20926 started causing problems again.

20926 seems to have been created because Chinese requires a space between Kanji(Chinese characters) and English. Fix #20926 should only apply to Chinese, not Japanese or Korean. Additionally, this fix does not take into account Japanese-specific characters, which causes problems with Japanese.

So I think the best way to solve this problem is to fix it so that the fix applied in #20926 is limited to Chinese only.

lamrongol commented 1 year ago

@Dub1shu My code only applies to Japanese-specific characters(Hiragana, Katakana, Hankaku-Katakana, "、". "。") , so this fix will not cause trouble to Chinese OCR, I think.

ghost commented 1 year ago

@Dub1shu My code only applies to Japanese-specific characters(Hiragana, Katakana, Hankaku-Katakana, "、". "。") , so this fix will not cause trouble to Chinese OCR, I think.

I think his (Dub1shu) meaning is that the code https://github.com/microsoft/PowerToys/pull/20926 caused the bug, so we need to fix the #20926. Instead of making new redundant code over and over again. Most likely the bug is caused by #20926, and This issue where spaces are inserted in Japanese is fixed in https://github.com/microsoft/PowerToys/pull/20415. Looks relatively simple, I'll try to fix it.

lamrongol commented 1 year ago

@AO2233

I think his meaning is that the code https://github.com/microsoft/PowerToys/pull/20926 caused the bug, so we need to fix the https://github.com/microsoft/PowerToys/pull/20926. Instead of making new redundant code over and over again.

No, I mean this code doesn't affect Kanji(Chinese characters) and #20926, either. So we don't need to fix the #20926.

lamrongol commented 1 year ago

Oh, I didn't know Chinese also uses "、" and "。".

Dub1shu commented 1 year ago

@AO2233 Wow. Thanks for creating PR! I was going to work on it tonight, but @AO2233 worked for this.

The mention of @AO2233 is exactly what I mean. @lamrongol 's method might work correctly, but it seems like a little hacky way to do it.

One of the following methods seems to be better.

  1. Japanese and Non-CJK word are separated by a space Add hiragana and katakana to regular expression. This is @AO2233's fix.

  2. Japanese and Non-CJK word are not separated by a space Limit #20926 to Chinese. This is method that I mentioned last night.

ghost commented 1 year ago

Thanks for your comments! If you have time, please check my code and verify the results. @Dub1shu If you have an uniform standard code, please commit it! @lamrongol Your method it's absolutely workable. I just think using hack style code will make this problem a little complex. Actually I think my code and #20926‘s code are not beautiful code. So if you have an uniform standard code, please commit it!

As for OCR, in chinese text detection or mixed text, I used PaddleOCR as an offline backend before(but not convenient), it's based on machine learning with well trained which have higher accuracy in chinese. This powertoy ocr use windows API works not very well, when the texts include many langages. Do you have any Japanese text dection model or engine recommendations?

My english and japanese are not well. If there is any mistake, please forgive me. 本当にありがとうございました。

TheJoeFin commented 1 year ago

@AO2233 feel free to check out Text Grab and see how the space logic is done.

https://github.com/TheJoeFin/Text-Grab/blob/main/Text-Grab/Utilities/LanguageUtilities.cs

ghost commented 1 year ago

@TheJoeFin WoW! The code looks neat and tidy now. I've seen the core code (https://github.com/TheJoeFin/Text-Grab/blob/main/Text-Grab/Utilities/OcrExtensions.cs#L40-L71) from this submit(https://github.com/TheJoeFin/Text-Grab/commit/c28de507f25c4952bc408b3657e7cd3987e06eb4), and checked this issue(https://github.com/TheJoeFin/Text-Grab/issues/191) before. 👀

The different between Text-Grab(Powertoy version) and Text-Grab(TheJoeFin orignal) version is different Regex, TheJoeFin's is more detailed and more radical:

# Text-Grab(TheJoeFin orignal)
    // (when OCR language is zh or ja)
    // matches words in a space-joining language, which contains:
    // - one letter that is not in "other letters" (CJK characters are "other letters")
    // - one number digit
    // - any words longer than one character
    // Chinese and Japanese characters are single-character words
    // when a word is one punctuation/symbol, join it without spaces
Regex regexSpaceJoiningWord = new(@"(^[\p{L}-[\p{Lo}]]|\p{Nd}$)|.{2,}");
// or
Regex regexSpaceJoiningWord = new(@"\p{IsCJKUnifiedIdeographs}|\p{IsGeneralPunctuation}|\p{IsCJKSymbolsandPunctuation}|\p{IsHiragana}|\p{IsKatakana}|\p{IsHalfwidthandFullwidthForms}");

# Text-Grab(Powertoy now)
// just add the most important Hiragana, Katakana, Hankaku-Katakana.
var cjkRegex = new Regex(@"\p{IsCJKUnifiedIdeographs}|\p{IsHiragana}|\p{IsKatakana}|[\uFF61-\uFF9F]");

It's hard to say which one's strategy is more better, maybe the orignal one. Both of these still have some small ignorable mistakes.

I think if we use Windows OCR api (1.missing information about spaces 2.compared to the SOTA ocr engine, have more error recognition) and need to add spaces by some fixed rules(like language grammar), we can not do the ocr job perfectly. If we want the ocr text is just totally same with the picture, we need a better engine(https://github.com/microsoft/PowerToys/issues/20899)?

Can you merge the latest Text Grab code to Powertoy🍟? It's much better than this old version.❤️

crutkas commented 1 year ago

Fixed in 0.71 release of PowerToys. aka.ms/installpowertoys