jbesomi / texthero

Text preprocessing, representation and visualization from zero to hero.
https://texthero.org
MIT License
2.88k stars 239 forks source link

Implement/support/explain topic modelling #42

Open jbesomi opened 4 years ago

jbesomi commented 4 years ago

Goal Implement topic modeling on Texthero.

Topic modeling There are mainly two ways to do topic modeling: LSA/LSI (latent semantic indexing) and LDA (Latent Dirichlet allocation). This simple tutorial explains how to implement it in python.

Python implementation LSA/LSI is just basically TF-IDF + SVD. What's it's important is to understand how to visualize and how to return the topic model information from the function.

Documentation Other than adding the docstring, it's probably useful to write a "getting started" tutorial on how topic modeling works and how to use Texthero's function.

We will probably want to implement both LSI and LDA, in two? separate functions.

This issue is a work in progress. Any help is very appreciated!

Devilmoon commented 4 years ago

I'm not sure if topic modeling has already been implemented in TextHero, however if it hasn't you might be interested in leveraging Gensim. I've used it in the past as a novice in topic modeling and it's relatively simple to use. If I remember correctly there is also support for visualization of the results, which seems to be the core of this issue.

Hope this helps!

jbesomi commented 4 years ago

Hey Luca,

No, topic modeling hasn't been implemented in Texthero (with the small h) yet. Gensim is an alternative but we might not need it either if we implement LSA as this the same as callingpca somehow, right?

And yes, the visualization and understanding of the models are for sure an important aspect but that's not the core of the issue. The core of the issue is to understand how to correctly implement topic modeling, which algorithm to pick, see if Gensim is strictly necessary, the function signature and output, and so on.

jwabant commented 4 years ago

@jbesomi For LSA or LDA I think Scikit Learn is a good option https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html, https://scikit-learn.org/0.16/modules/generated/sklearn.lda.LDA.html (because you already use it for vectorization, dimension reduction and clustering operations already). I see it as a start, before integrating more advanced methods such as Correlated Topic modeling or Structural Topic models (with or without covariates - the second being implemented only in R in open-source to my knowledge). For the rendering of topics, classically people who use Scikit functions seem to define functions like ​​print_topics here https://github.com/amueller/mglearn/blob/master/mglearn/tools.py, but we could imagine something else

jbesomi commented 4 years ago

Thank you Julia! Soon, @henrifroese and @mk2510 will work on this. And I agree, it's good to start with LSA and LDA, see how it goes, and eventually introduce more advanced methods.