Closed lmullen closed 1 year ago
Can't figure out, is it based on wordnet?
@dselivanov It looks like gensim provides the lemmatizing function via the Python patterns package. I see some references to wordnet in that package, but it appears to use a rule based function:
Wordnet seems like the solution I would use.
Wordnet seems like a good approach to take but it's pretty substantial in terms of codebase, and a certain amount ornery around windows. There's also LemmaGen, which is sat in C++, appears a lot less complainy when it comes to multi-platform installs, and has support for a ton of non-EN languages too.
In both cases I guess I'd worry about how it'd increase the size of the codebase. Could this do better in a distinct, suggested/recommended package, maybe? lemmatizers and tokenizers
@Ironholds Yes, it might be a good idea to break them out into separate packages. There is a wordnet package already (https://cran.rstudio.com/web/packages/wordnet/) but I haven't looked closely at its functionality, and ideally this would be possible without using Java.
I have no immediate plans to add this functionality here or in a separate package, but this issue is just to remind me to look into this more closely at some point.
Gotcha! Okay.
Not going to make this addition at this time.
gensim has a lemmatizing tokenizer, which, instead of stemming words, converts them to their lemma. For instance, "was," "being," "am" would tokenize to "be."
https://radimrehurek.com/gensim/utils.html