mit-nlp / MITIE

MITIE: library and tools for information extraction
2.92k stars 537 forks source link

Best way to replace stemmer or have multiple stemmers? #63

Closed avitale closed 8 years ago

avitale commented 8 years ago

Hi Davis,

Looking at the code, it seems to me that everything is language agnostic apart from the english stemmer used in text categorization.

What would be the best way to replace the stemmer with another one, or even better have multiple stemmers for different languages?

Thank you very much!

davisking commented 8 years ago

I wouldn't worry about it. The word vectors MITIE uses dynamically generate word morphology features when you train wordrep, so the stemming isn't very important. The main thing is to have a tokenizer that makes sense for your language. You can also perform any kind of reasonable token normalization at that processing stage as well.

avitale commented 8 years ago

ok, thanks!