stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.67k stars 2.7k forks source link

Lemmatization in spanish #137

Closed sebastiandev closed 8 years ago

sebastiandev commented 8 years ago

I'm trying to get the lemmas of each word from the provided text, but I couldn't find any docs about it. I'm wondering if there are any particular parameters like with the other annotators, since I'm defining the language for the tokenizer and spanish models for the pos,parse and ner.

What I'm getting now looks like it didn't find anything and is returning the same word.

"caminando" => "caminando" (should be "camina") "arboles" => "arboles" (should be "arbol") "corriendo" => "corriendo" (should be "corre")

manning commented 8 years ago

Sorry, but we don't currently have a Spanish lemmatizer. http://stanfordnlp.github.io/CoreNLP/index.html#human-languages-supported

pommedeterresautee commented 8 years ago

@manning is there a way to override the english lemma dictrionary with a custom one? I can't find where I should override in the java code.

manning commented 7 years ago

Currently I think it is hardwired. It would be sensible to make that something that could be overwritten, but it's not just a dictionary. It's a compiled piece of code. (Essentially, a bit finite automaton.)