ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

Lemmatizing tokenizer #5

Closed lmullen closed 1 year ago

lmullen commented 8 years ago

gensim has a lemmatizing tokenizer, which, instead of stemming words, converts them to their lemma. For instance, "was," "being," "am" would tokenize to "be."

https://radimrehurek.com/gensim/utils.html

dselivanov commented 8 years ago

Can't figure out, is it based on wordnet?

lmullen commented 8 years ago

@dselivanov It looks like gensim provides the lemmatizing function via the Python patterns package. I see some references to wordnet in that package, but it appears to use a rule based function:

https://github.com/clips/pattern/blob/820cccf33c6ac4a4f1564a273137171cfa6ab7cb/pattern/text/en/inflect.py#L645

Wordnet seems like the solution I would use.

Ironholds commented 7 years ago

Wordnet seems like a good approach to take but it's pretty substantial in terms of codebase, and a certain amount ornery around windows. There's also LemmaGen, which is sat in C++, appears a lot less complainy when it comes to multi-platform installs, and has support for a ton of non-EN languages too.

In both cases I guess I'd worry about how it'd increase the size of the codebase. Could this do better in a distinct, suggested/recommended package, maybe? lemmatizers and tokenizers

lmullen commented 7 years ago

@Ironholds Yes, it might be a good idea to break them out into separate packages. There is a wordnet package already (https://cran.rstudio.com/web/packages/wordnet/) but I haven't looked closely at its functionality, and ideally this would be possible without using Java.

I have no immediate plans to add this functionality here or in a separate package, but this issue is just to remind me to look into this more closely at some point.

Ironholds commented 7 years ago

Gotcha! Okay.

lmullen commented 1 year ago

Not going to make this addition at this time.