mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
https://mimno.github.io/Mallet/
Other
984 stars 344 forks source link

Lemmatizing with Mallet #203

Open Glorifier85 opened 3 years ago

Glorifier85 commented 3 years ago

HI there,

First of: great solution that has helped me a lot in the past. I am currently preparing to do topic modeling via Mallet and have finished pulling the raw datasets. Before I import and start modeling, I need to take some steps to clean and streamline the texts. What I am a little fuzzy about is stemming and lemmatizing. Not on the concept itself but rather what the best approach would be.

To be specific, here is what I need to do:

I realize that this is not exactly an issue with Mallet but I was hoping that anyone, based on experience, could recommend an approach on how to best tackle that?

Many thanks in advance!

sdedeo commented 3 years ago

Perhaps you might look at https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00099/43370/Comparing-Apples-to-Apple-The-Effects-of-Stemmers which not only has David on the author list, but also just has a terrific title!

My personal experience tends to confirm the broad conclusions there. I can’t stop my collaborators from stemming, though!

I would suggest that standardizing spelling may be more trouble than it’s worth. In the worst-case, where one document category spells something funny (colour, e.g.) I’ve found there are enough contextual clues for the model to realize color==colour, and put them in the same topic.

Finally, extra white spaces won’t affect Mallet output (under the standard options).

Simon DeDeo Carnegie Mellon University & the Santa Fe Institute http://santafe.edu/~simon

On Jun 29, 2021, at 11:34 AM, Glorifier85 @.***> wrote:

 HI there,

First of: great solution that has helped me a lot in the past. I am currently preparing to do topic modeling via Mallet and have finished pulling the raw datasets. Before I import and start modeling, I need to take some steps to clean and streamline the texts. What I am a little fuzzy about is stemming and lemmatizing. Not on the concept itself but rather what the best approach would be.

To be specific, here is what I need to do:

standardize inconsistencies in spelling, e.g. topicmodeling -> topic modeling remove extra whitespaces from words, e.g. two whitespaces in a row stem and lemmatize I realize that this is not exactly an issue with Mallet but I was hoping that anyone, based on experience, could recommend an approach on how to best tackle that?

Many thanks in advance!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.