Topic Modelling for Non-text data

mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

https://mimno.github.io/Mallet/

Other

984 stars 344 forks source link

Topic Modelling for Non-text data #185

Closed shashankg7 closed 4 years ago

shashankg7 commented 4 years ago

Hi,

I am trying to run topic modelling on a user-modelling task. I have a list of activities of users and I am treating each activity (on item) as word. The items are encoded as integers.

I am trying to run LDA on this data. I tried a similar approach to running topic modelling on a directory with multiple text files, but LDA is throwing NaN when running it.

is the tool limited to textual input? Any suggestions would be appreciated.

SeaCelo commented 4 years ago

I’m not positive, but I seem to remember that the pre-processing regex removes all digits. You probably can edit the regex though.

mimno commented 4 years ago

Yes, use --token-regex '\d+'. You're probably getting NaN because there are zero recognized tokens, and it's trying to divide by that number.

shashankg7 commented 4 years ago

Thank you for your response.

I added [0-9]* in the regex. That is equivalent to this, right?

Also, I am trying to index around 300k documents using this command, but it's been ~24hours since its running, still not completed.

Any suggestions around this?