obynio / anki-japanese-furigana

Anki add-on providing support for adding furigana on Japanese text
https://ankiweb.net/shared/info/678316993
GNU General Public License v3.0
17 stars 5 forks source link

Retain ASCII space characters in original input string #17

Closed ahlec closed 1 year ago

ahlec commented 1 year ago

The response from MeCab is space-delineated, where individual kanji/reading nodes are separated with an ASCII space (eg, リンゴ[リンゴ] を[ヲ] 食べる[タベル]). When we process the result, we work on a single MeCab node at a time to determine when we actually need a reading/need furigana or not. To do this, we split the string on ASCII spaces. But this means that any input string that does have an ASCII space in it has them stripped from the final result, as MeCab doesn't distinguish spaces that were in the original string from ones that are being added to delineate nodes.

In order to fix this, I modified the code to replace all ASCII spaces (0x20 codepoint) with a Unicode character that we should never encounter in any card in the wild. At the end, we then reverse the replacement. This is the same approach as is handled for newlines.

I've tested this on both Anki 2.1.54 (M1 Silicon) and Anki 2.1.49 (Mac Intel) and have run into no issues.

obynio commented 1 year ago

Seems a very good idea indeed, I doubt anyone notices the issue since Japanese usually do not use spaces. But nevertheless it's still a good addition ! I will merge and release this change in 1.3.1