mortii / anki-morphs

A MorphMan fork rebuilt from the ground up with a focus on simplicity, performance, and a codebase with minimal technical debt.
https://mortii.github.io/anki-morphs/
Mozilla Public License 2.0
52 stars 7 forks source link

spaCy transformer models #111

Closed mortii closed 8 months ago

mortii commented 8 months ago

Not sure if this belongs here or if I should make a new issue, but in terms of documentation, currently the docs direct the user to pick models via Spacy's little interactive box on their language models page.

I would recommend putting a warning in there that for some languages this will suggest transformer models if you request a higher accuracy model (ending _trf), which will likely require additional dependencies (namely the spacy-transformers package). I couldn't get these to work with ankimorphs even after installing those dependencies!

The large (ending _lg) are what I would default to personally, where available. For some languages this is what the UI recommends as a 'more accurate' model for the language, for others it is not (where it will recommend 'trf') - almost all, if not all, languages with a trf model also have an lg model, but you have to actually navigate to the language's packages page to see that.

Edit: A note - probably I would guess that those dependencies didn't install properly, given it's pulling GPU cuda stuff and I have an AMD card. This kind of thing is often a headache on my distro - I am not using a virtualenv for my spacy + anki, so the blame is on me there.

Originally posted by @ashprice in https://github.com/mortii/anki-morphs/issues/110#issuecomment-1872573615

mortii commented 8 months ago

I haven't tried using transformer models, you would have to load them differently if I recall correctly. I'll test if I can make it work somehow.

mortii commented 8 months ago

The runtime stuff isn't that much different thankfully, but the installation is brutal: https://spacy.io/usage/embeddings-transformers#transformers-installation

I don't love the 5GB download, but I especially don't like that you have to set the cuda path, doing that in windows is a massive headache.

I'm not sure transformers are worth all of that.... Do you have any experience using them @ashprice? Are they significantly better than the basic models?

ashprice commented 8 months ago

No, I haven't tried them beyond trying to install them for Ankimorphs. Partly because as I said, I couldn't get the dependencies to work using my OS' package manager - probably things would be smoother in a virtualenv.

It's possible that their advantages over the others aren't even in, say, lemmatization (which is probably what matters for us most), but POS tagging, for example. And maybe this is different for the different languages (I haven't looked through the benchmarks systematically). On the page for each language, you can see some benchmarks for each of the models. Here are the values for da_core_news_trf:

TOKEN_ACC   Tokenization    1.00
TOKEN_P     1.00
TOKEN_R     1.00
TOKEN_F     1.00
POS_ACC Part-of-speech tags (coarse grained tags, Token.pos)    0.99
MORPH_ACC   Morphological analysis  0.98
MORPH_MICRO_P       0.99
MORPH_MICRO_R       0.99
MORPH_MICRO_F       0.99
SENTS_P Sentence segmentation (precision)   0.93
SENTS_R Sentence segmentation (recall)  0.93
SENTS_F Sentence segmentation (F-score) 0.93
DEP_UAS Unlabeled dependencies  0.90
DEP_LAS Labeled dependencies    0.87
LEMMA_ACC   Lemmatization   0.96
TAG_ACC Part-of-speech tags (fine grained tags, Token.tag)  0.99
ENTS_P  Named entities (precision)  0.89
ENTS_R  Named entities (recall) 0.91
ENTS_F  Named entities (F-score)    0.90

Compared to da_core_news_lg:

TOKEN_ACC   Tokenization    1.00
TOKEN_P     1.00
TOKEN_R     1.00
TOKEN_F     1.00
POS_ACC Part-of-speech tags (coarse grained tags, Token.pos)    0.97
MORPH_ACC   Morphological analysis  0.96
MORPH_MICRO_P       0.97
MORPH_MICRO_R       0.97
MORPH_MICRO_F       0.97
SENTS_P Sentence segmentation (precision)   0.89
SENTS_R Sentence segmentation (recall)  0.88
SENTS_F Sentence segmentation (F-score) 0.89
DEP_UAS Unlabeled dependencies  0.82
DEP_LAS Labeled dependencies    0.78
LEMMA_ACC   Lemmatization   0.95
TAG_ACC Part-of-speech tags (fine grained tags, Token.tag)  0.97
ENTS_P  Named entities (precision)  0.80
ENTS_R  Named entities (recall) 0.82
ENTS_F  Named entities (F-score)    0.81

A lot of these are quite similar, some seem quite a bit different. But how much value is that going to really impart in practice? I can't say.

I am sceptical that for our purposes that it is really all that important, vs. just using the current CPU-optimized models and waiting for better CPU-optimized models in the future. The only reason I mentioned it originally is because of the headache it caused me when I unsuspectingly went for a _trf model as recommended by the UI, only to discover that the _lg model works well enough and took all of a few seconds to get it to work (excluding download time). It's not obvious that the _lg models exist in addition to the `_trf¬ models unless you navigate to the page for the individual language.

mortii commented 8 months ago

@ashprice great stuff, I completely agree. I'll update the guide with a short explanation of the models types, and that transformers are not supported. Thank you!

github-actions[bot] commented 6 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.