kaegi / MorphMan

Anki plugin that reorders language cards based on the words you know
Other
260 stars 66 forks source link

Bugs with new AnkiSpacy implementation #229

Open nlovell1 opened 3 years ago

nlovell1 commented 3 years ago

Using latest versions of MorphMan / AnkiSpacy from github. Anki 2.1.35. Using latest Japanese large model. Recalc works fine. Study plan was able to be generated, recalc after works fine as well.

  1. Might have some trouble with interpreting lines on subtitle files. I had two sets of shows that I inputted the subs for in the study plan. They used to both give me line statistics but now only one of them calculated (the other was 0). Not sure how to replicate this. Will update.

  2. Known morphs, known variations are the same number. Not sure if this is all a bad thing. The new parser seems to be doing its job pretty well from what I've seen- have to do more testing to see how it counts congugations of the same verb (whether it counts the helpers individually, or they're stored under the same dictionary form of the verb it was found in). Will update.

  3. The past feature of adjusting new cards to align with a frequency list and study plan is now not working with the new parser. Like I said, I could generate a study plan, but no cards had the tag 'frequency list' in the browser.

  4. Parsing, especially when trying to parse subtitle files, feels a LOT slower.

Thanks for all the hard work.

cordone commented 3 years ago

On 4, make sure you're using nlp.pipe(list of text) instead of repeated nlp(text) calls. I think the pipe command also offers multi-threading with the n_process argument, but I don't know if increasing it still works with the Japanese model. I ran into issues with tokens having empty features the last time I tried it. With that, the processing time even for several thousand sentences should still be somewhat sane.

Other than that, SudachiPy seems currently ~30x slower than MeCab according to my own tests using this benchmark. When the big push to add Japanese support to spaCy was going on last year, some work was done to increase SudachiPy's performance with Cython, but that work is AFAIK unfinished and not seen as a priority. AFAIK they're the same kind of tokenizer, so I assume that SudachiPy can be much closer to MeCab's speed (seems at one point there was a test version only ~10x slower), it's just a WIP.

nlovell1 commented 3 years ago

Thank you for your reply and the information.

Do you know of any way to use JMdict entries to train a higher quality model? (especially because the current spacy models are trained on news text, it's having some problems with collocations and colloqualisms right now as I go through cards). Lots of expressions and verbs made of compound parts are being parsed too finely and causing a lot of problems in my cards. the project ichi.moe does something pretty much similar to what I'm thinking of, with a few tweaks. Would like to know your thoughts on this

edit, remembered you told me about training my own model (https://github.com/kaegi/MorphMan/issues/225#issuecomment-754363709)... really interested, how do I go about that?