Replacing MeCab with alternative parsing dictionary / adding implementation for user input when reviewing cards

nlovell1 commented 3 years ago

Not sure if this is the best place to post, but I'm new to GitHub so please let me know.

I'm interested in helping with replacing MeCab with another parser, particularly out of frustration with 1. homophonic grammar structures marked as 'known' actually have more than one, often different semantic uses, and 2. disregard to collocations, colloquialisms, and figures of speech and instead breaking them up... both of which in my experience have brought cards to i+2 or greater. It seems that what would be needed to solve this is beyond the scope of general tokenizers / morphological analyzers. Morphemizers like MeCab or even Sudachi seem to tokenize a sentence into “morphemes", but my expected results are actually 文節 (clauses)... the only software I can find that does that is J.depP https://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jdepp/, and I'm unsure how this would be implemented.

I would also like to revamp the system involving comprehension cards, as often the morph indicated as the target morph of that sentence is not actually the morph that is unknown to the user. Ideally I would like to see implementation that asks for user input to redefine the target, or rather unfamiliar/unknown morph in a sentence when the parsing dictionary gets it incorrect. It is unclear at this time to whether or not improving the parser would even facilitate the need for this implementation, but as of right now, I think that could be a potential band-aid.

I would love to help out the development of this, but am a little unsure on where to start. Please let me know if I can do anything. Thank you.

cordone commented 3 years ago

Have you seen #162 and #221? It seems like it'll be good to go soon.

With spaCy you'll have an entire NLP pipeline that can, among other things, extract grammatical relationships between words (dependency parsing). The provided token features should be enough to distinguish between homophones. See here. More are being added in an upcoming release.

It should also be pretty straightforward to use those features to find common n-grams.

Matching specific collocations and phrases is going to be more complicated. spaCy provides Matcher and PhraseMatcher components that may be of use here, but those may involve training your own model. I'm not that familiar with them.

An alternative option, more on the MeCab and SudachiPy level, would be to create a user dictionary containing phrases. This is what the creators of jisho.org did. They made a user dictionary for MeCab with phrases from JMdict. spaCy uses SudachiPy for its Japanese models, but I don't know if you can just add a user dictionary and expect it to work without training a new model.

nlovell1 commented 3 years ago

spaCy looks really cool. I never knew about it before. I'll be sure to check the change logs and see what I can do.

On another note, the creator of ichi.moe (which I've found to be INCREDIBLY accurate for identifying phrases and grammatical structures, just released a command line utility for the front-end algorithm ichi.moe on their blog. If you haven't seen used ichi.moe before, I would suggest trying it. It works exceptionally well at parsing sentences. As far as I can tell, it brute forces JMDict and has a lot of hard coded exceptions. Regardless, I am curious to see the performance of expressions on spaCy, I'm about to test it out.

As far as I know, SudachiPy doesn't won't work with any user dictionary.

I'll keep you posted.

nlovell1 commented 3 years ago

@cordone it seems that through experimenting with Spacy, there is still more work to be done to recognize multi word expressions and phrases/collocations using the PhraseMatcher and other things. I'm not sure how much time I should spend on it, seeming as Morphman maintainers have not merged the update created by @rteabeault here. It seems as if Morphman is largely abandoned and there's no QC for buggy implementations per the AnkiWeb review/feedback. I'm happy to expand on what I've found out so far, though, if you're interested in the details. I remember we were engaging on another post about NLP related topics, so maybe you might find it of interest.

kaegi / MorphMan

Replacing MeCab with alternative parsing dictionary / adding implementation for user input when reviewing cards #225