Open nlovell1 opened 3 years ago
On 4, make sure you're using nlp.pipe(list of text)
instead of repeated nlp(text)
calls. I think the pipe
command also offers multi-threading with the n_process
argument, but I don't know if increasing it still works with the Japanese model. I ran into issues with tokens having empty features the last time I tried it. With that, the processing time even for several thousand sentences should still be somewhat sane.
Other than that, SudachiPy seems currently ~30x slower than MeCab according to my own tests using this benchmark. When the big push to add Japanese support to spaCy was going on last year, some work was done to increase SudachiPy's performance with Cython, but that work is AFAIK unfinished and not seen as a priority. AFAIK they're the same kind of tokenizer, so I assume that SudachiPy can be much closer to MeCab's speed (seems at one point there was a test version only ~10x slower), it's just a WIP.
Thank you for your reply and the information.
Do you know of any way to use JMdict entries to train a higher quality model? (especially because the current spacy models are trained on news text, it's having some problems with collocations and colloqualisms right now as I go through cards). Lots of expressions and verbs made of compound parts are being parsed too finely and causing a lot of problems in my cards. the project ichi.moe does something pretty much similar to what I'm thinking of, with a few tweaks. Would like to know your thoughts on this
edit, remembered you told me about training my own model (https://github.com/kaegi/MorphMan/issues/225#issuecomment-754363709)... really interested, how do I go about that?
Using latest versions of MorphMan / AnkiSpacy from github. Anki 2.1.35. Using latest Japanese large model. Recalc works fine. Study plan was able to be generated, recalc after works fine as well.
Might have some trouble with interpreting lines on subtitle files. I had two sets of shows that I inputted the subs for in the study plan. They used to both give me line statistics but now only one of them calculated (the other was 0). Not sure how to replicate this. Will update.
Known morphs, known variations are the same number. Not sure if this is all a bad thing. The new parser seems to be doing its job pretty well from what I've seen- have to do more testing to see how it counts congugations of the same verb (whether it counts the helpers individually, or they're stored under the same dictionary form of the verb it was found in). Will update.
The past feature of adjusting new cards to align with a frequency list and study plan is now not working with the new parser. Like I said, I could generate a study plan, but no cards had the tag 'frequency list' in the browser.
Parsing, especially when trying to parse subtitle files, feels a LOT slower.
Thanks for all the hard work.