mortii / anki-morphs

A MorphMan fork rebuilt from the ground up with a focus on simplicity, performance, and a codebase with minimal technical debt.
https://mortii.github.io/anki-morphs/
Mozilla Public License 2.0
52 stars 7 forks source link

Installing Spacey #110

Closed HQYang1979 closed 8 months ago

HQYang1979 commented 8 months ago

image

The error message says cannot load file C:.....Activate.ps1, because script was disabled to run in this system...

mortii commented 8 months ago

@HQYang1979 thank you, good to know. How can it be fixed?

HQYang1979 commented 8 months ago

Done, I have to change the group policy to enable all the scripts.

Will you make it run directly instead through terminal?

Let me mark the new things: image image image image image image image the U and A changed significantly: image

mortii commented 8 months ago

Done, I have to change the group policy to enable all the scripts.

Nice. Problematic for people that don't have the admin rights to change group policies, but at that point maybe Anki and spaCy shouldn't be installed.

Let me mark the new things: image

To make spaCy work I had to make some minor changes to the caching morphs' procedure, and the UI.

the U and A changed significantly: image

yep, as expected--more accurate numbers hopefully.

What is your initial impression of spaCy? Is it much slower? Is it more accurate?

HQYang1979 commented 8 months ago

But I am not sure if it really works unless the unknowns show the basic forms.

image image image image

HQYang1979 commented 8 months ago

What is your initial impression of spaCy? Is it much slower? Is it more accurate?

@mortii Give me some time to immerge myself. It is slower because it really starts to analyze the words. I am not sure about accuracy since I cannot see the basic form.

HQYang1979 commented 8 months ago

Maybe I am wrong, but correct me if I am wrong, I am not sure if it is working.

The word "filled", when I press L, the morphs should be "fill, filled, etc", but there are only "filled".

image image

But somehow it is better: image

ashprice commented 8 months ago

Not sure if this belongs here or if I should make a new issue, but in terms of documentation, currently the docs direct the user to pick models via Spacy's little interactive box on their language models page.

I would recommend putting a warning in there that for some languages this will suggest transformer models if you request a higher accuracy model (ending _trf), which will likely require additional dependencies (namely the spacy-transformers package). I couldn't get these to work with ankimorphs even after installing those dependencies!

The large (ending _lg) are what I would default to personally, where available. For some languages this is what the UI recommends as a 'more accurate' model for the language, for others it is not (where it will recommend 'trf') - almost all, if not all, languages with a trf model also have an lg model, but you have to actually navigate to the language's packages page to see that.

Edit: A note - probably I would guess that those dependencies didn't install properly, given it's pulling GPU cuda stuff and I have an AMD card. This kind of thing is often a headache on my distro - I am not using a virtualenv for my spacy + anki, so the blame is on me there.

mortii commented 8 months ago

@mortii Give me some time to immerge myself. It is slower because it really starts to analyze the words. I am not sure about accuracy since I cannot see the basic form.

If you go the the browser and right click the cards -> view morphemes, then you should see the base form and the inflected form.

The word "filled", when I press L, the morphs should be "fill, filled, etc", but there are only "filled".

This is intentional, ref my previous answers: https://github.com/mortii/anki-morphs/issues/76#issuecomment-1836059215, https://github.com/mortii/anki-morphs/issues/76#issuecomment-1836458680

HQYang1979 commented 8 months ago

@mortii Give me some time to immerge myself. It is slower because it really starts to analyze the words. I am not sure about accuracy since I cannot see the basic form.

If you go the the browser and right click the cards -> view morphemes, then you should see the base form and the inflected form.

The word "filled", when I press L, the morphs should be "fill, filled, etc", but there are only "filled".

This is intentional, ref my previous answers: #76 (comment), #76 (comment)

Thank you. I see the morphenes now!

mortii commented 8 months ago

@HQYang1979 is it okay if I use your picture:

image

In the guide?

HQYang1979 commented 8 months ago

sure, of course

ashprice commented 8 months ago

Maybe we should start a discussion thread for thoughts on spacy?

I have to say that overall my initial experience with Ankimorphs + spacy is quite positive. In general I'm finding the difficulty of the cards selected to be appropriate in terms of words chosen. This is obviously better in bigger decks where the difficulty jumps can be more fine-grained. I'd like to say a big thank you to @mortii for putting in the legwork and for everyone else who has contributed along the way! Really, I think ankimorphs is pretty awesome in its current state.

Some thoughts and rambles follow..


Languages with spaces

First - the ankimorphs parser for languages with spaces is quite the step down vs. the spacy models. Of course, it is better than no frequency-based ordering, but yeah the experience with a deck that has a spacy model vs. one that doesn't is quite jarring. I might look into how feasible it is to train some models with free tagged corpora - but I know that even finding good, sizeable (and free) corpora may be hard for languages that do not already have models trained, and that's before whatever work is involved in making the PoS tagging etc. available to spacy. It's possible that some public organisations might be willing to give rights for usage on principle given ankimorphs could make learning smaller languages a lot more accessible.

Japanese

For Japanese, I am less decided. I cannot decide whether I prefer the ankimorphs parser or the spacy model... one would think it would be the former given how the latter is segmenting things, but trying both out, it certainly doesn't feel that obvious.

I should note that I personally do not ignore comprehension cards - if they are in my stack of cards, I view them, either for first review, or for deletion. This is just my preferred way of working - I don't want a load of cards that aren't used for anything in my collection, and if it's not a mass-mined but hand-made/mined card, even if ankimorphs thinks it is a 'comprehension card,' the card is there for a reason (namely I made a card to remember it!)

Grammar

The 'open problem' I would say is something I alluded to in #13 - difficulty-by-word is one thing, but what is not so easy is difficulty-by-grammar. And maybe we simply decide this is out of the scope of the project, as it seems a hard problem to solve.

Here, ankimorphs is less consistent, because we are ordering cards based on lemmas, not phrase-level constructions and not the internal complexity of the word the lemma occurred in. To some degree this is one of the things I like about the spacy model. For example, @mortii mentioned the case of 本当に being parsed as 本当 + に, and you'll find the same with eg. 別に, ために, etc. Now, sure, 本当に has a dictionary entry, and it makes total sense for that to be the case, and if you introduce that phrase to a learner, you'll likely introduce it the first time as 本当に = really, truly, etc. not as 本当 = real, true + に = adverbial copula. But, really what we have here is 本当 + に, where に is functioning as an adverbializer, and this structure is shared by all of these words/constructions - sure there are some that have unpredictable meanings, but by the _n_ᵗʰ phrase that you can decompose as Noun + に you know the pattern - you don't need to separately learn all the words ending in -ly in English, you just learn -ly makes adverbs and then you deal with the few cases with weird meanings as and when.

I don't think that particular case is that important... So moving on to something more complicated: compound verbs. Here I feel that both spacy and ankimorphs are good in different ways. Many compound verbs are easily decomposable, so you don't need to learn the whole compound as a lexical unit, you just need to know the components, for example, in 走り回る, you don't really need to learn 走り回る, you can just learn 走る + 回る. The spacy model does this quite well. For less clear compounds, though, ankimorphs does better because it matches for the whole construction.

One approach to this as was discussed a bit before is to match your string or a concatenation of some strings against a list of words, for example, taking ほんとう + に and matching for dictionary entries that are, well, 本当 + に, ie. 本当に. I think ichi.moe probably has this approach down the best that I've seen, my guess is that it matches for the longest string but with some hardcoded exceptions or rules based on context. JPDB's parser also deserves a mention, but I am imagining they use some transformer-based spacy model in the background, or something like it that is completely homegrown (the latter wouldn't surprise me given what I know of the developer, but I haven't asked them about this.)

There are things that are more complicated than ṁany of those cases though: phrase level constructions, even auxiliary verb conjugation. To some degree matching for lemmas works for both (I think spacy might have an edge for conjugation, probably ankimorphs is better at phrase level stuff).

But there's another approach - use dependency analysis. This was why I originally pointed to ginza - it's some work and the documentation is lacking, but you can get really good dependency info out of it for Japanese. But I'd kind of like an easier solution than ginza, and spacy does make available dependency info for some (most? all?) models, there is some documentation here: https://spacy.io/usage/linguistic-features#dependency-parse .

If we can somehow use this info to adjust the difficulty of cards, maybe by treating different dependency structures as 'schemas' that are learnt like vocabulary, hopefully there'd be a marked improvement in the ordering of cards. Possibly the morphologizer can be used productively too.

Equally, of course, I think it's clear that this is a case of chasing diminishing returns: ankimorphs + spacy works pretty damn well as is. And as @mortii noted, spacy hasn't proven to be a panacea. And, just as spacy hasn't been a panacea, it's unlikely including some kind of grammar-based adjustments will be.

I'll probably toy around with implementing this idea on my end, but I cannot say when I'd have a proof-of-concept. This is mostly just brainstorming on my part. :)


I'd like to wish a happy new year to all of you!

mortii commented 8 months ago

Maybe we should start a discussion thread for thoughts on spacy?

@ashprice Absolutely! spaCy thoughts: #115

mortii commented 8 months ago

Updated the guide: https://mortii.github.io/anki-morphs/user_guide/installation/installing-spacy.html

Thanks for the feedback!

github-actions[bot] commented 6 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.