quanteda / spacyr

R wrapper to spaCy NLP
http://spacyr.quanteda.io
251 stars 38 forks source link

Warning message: lemmatization may not work properly #247

Closed seb-29 closed 8 months ago

seb-29 commented 9 months ago

I parsed a corpus in German using the model de_core_news_sm as follows:

spacy_parse(myquantedacorpus, pos=TRUE, tag=TRUE, lemma=TRUE)

spacyr (spaCy Version: 3.7.2) issued the following warning message:

Warning message:
In spacy_parse.character(myquantedacorpus, pos=TRUE, tag=TRUE, lemma=TRUE) : lemmatization may not work properly in model 'de_core_news_sm'

I got the same error message when using the supposedly better model de_core_news_lg. Hence my question(s): How should this warning be interpreted? Can I trust the results? Which lemmatizer is good (in R)?

To my knowledge, the best performing model should be de_dep_news_trf. However, in the past, I had some troubles with GPU stuff... (see issue 215).

kbenoit commented 8 months ago

This is a spaCy feature, not a spacyr issue. But judging from the sources you link above, even the smallest model seems to obtain a .97 accuracy on the lemmatizer. See https://spacy.io/models/de#de_core_news_sm-accuracy.