Replace Kakasi and fix issue around katakana-kanji compounds

ahlec commented 2 years ago

This closes #18. Background information and general reason for change can be found there.

The change to support ローマ字 and ローマ帝国 (I'm not sure why but Mecab breaks down specifically with ローマ) are simply the lines that change reading[index] == kanji[index] into areKanaEqual(reading[index], kanji[index]). The bug itself is fixed by making sure that we're comparing both sides of the equation in hiragana, rather than having 'ろ' == 'ロ'. I needed to switch from us using reading to create the ReadingNode into using kanji because we've already converted reading into hiragana, and this meant we were getting an output of ろーま字 otherwise.

However, the performance of doing this using Kakasi (our existing means of converting katakana → hiragana) was really bad. This is because we're running a subprocess and doing I/O every time we call the Kakasi function. However, Kakasi wasn't actually necessary. We were only using it on pathways where we specifically wanted to convert hiragana/katakana/mixed hiragana-katakana → only hiragana. While the program can handle significantly higher complexity, our existing usages of it just used it on reading, which is pure kana. Interestingly, I've checked the original plugin this forked from and they were also using it just for reading.

I spent some time to write a number of unit tests for Kakasi, and ensure all of them passed. I then removed Kakasi and replaced it with an implementation that uses string.translate and transforms characters in the Katakana Unicode block into ones in the Hiragana Unicode block (with 2 exceptions) and leaves the rest intact. I found that this satisfied all of the unit tests for Kakasi, and also improved our performance. It also meant we have fewer applications that we ship with this plugin.

As part of this, I also updated our comments on the flow through MecabController.reading, which erroneously indicated we were going down pathways we weren't actually going down.

I've tested this on Anki 2.1.54 (M1) and 2.1.49 (OSX Intel). The current version (including the type annotations) work in both of these versions.

I absolutely understand that this is a major change and am open to discussion and taking time to ensure this is the change we want and doesn't introduce regressions.

obynio commented 2 years ago

Okay so I thought about this change for a few days. I'm a bit conflicted about it because the original plugin ClozeFuriganaTool which inspired most like-minded plugins on AnkiWeb have all been using Kakasi for the last 10 years...

But on the other hand this kind of unicode arithmetic is not very complex, and ultimately I can't find a valid reason why we should not try it. It's a good idea and it's worth the shot. I'll merge this change and release it.

obynio commented 2 years ago

@ahlec I also sent you an invite to be a collaborator on the repo if you'd like to be one. You have clearly a good understanding of the plugin and you've been a precious help in the last few weeks 👍

obynio commented 2 years ago

@ahlec actually I think the problem with ローマ字 might come because of the old version of UniDic that we're using as a dictionary for MeCab. More recent versions are available on this site but the whole dictionary size is ~481MB !

That's why we are stuck with Unidic 2.1.2 from 2013, which is the most recent release of UniDic that's small enough to be distributed.

ahlec commented 2 years ago

Oooh. Updating the dictionary could also improve the accuracy of the readings. Makes a ton of sense that we can't put the plugin on ankiweb.net weighing ~500MB haha. I wonder if we couldn't add an option for users to download the dictionary after installing the plugin, like mobile apps do? Could still distribute the older UniDic as a fallback, but let users update it within Anki if they want to.

Counter point, it now means we need to handle bugs from two versions of the dictionaries.

obynio commented 2 years ago

I was thinking the same thing. Actually there is just the archive to download and to replace these files in the support folder. The file dicrc is here to tell mecab how to "encode" the data so that it can be read by the plugin afterwards.

char.bin, matrix.bin, sys.dic and unk.dic

obynio / anki-japanese-furigana

Replace Kakasi and fix issue around katakana-kanji compounds #20