tmoravec / thematic

Making sense of large and noisy FB pages
https://tadeas.github.io/thematic/
1 stars 1 forks source link

process.py not working on my windows pc #1

Closed fiuderazes closed 7 years ago

fiuderazes commented 7 years ago

Hello I have some problems working with your nice python files and I think if you may or can help me with this "k" error.

c:\Python36\osint\topic-analyst>py -3.6 process.py islamiromania.pkl C:\Python36\lib\site-packages\gensim\utils.py:855: UserWarning: detected Windows ; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") Slow version of gensim.models.doc2vec is being used Sun Mar 19 12:00:00 2017 Loading data. Sun Mar 19 12:00:02 2017 Starting to vectorize. Sun Mar 19 12:00:07 2017 min_df: 3 Sun Mar 19 12:00:07 2017 Tfidf ignores 21030 terms. Sun Mar 19 12:00:07 2017 Tfidf matrix shape: (79, 212) Sun Mar 19 12:00:07 2017 Generating 200 LSA components and normalizing Traceback (most recent call last): File "process.py", line 492, in main() File "process.py", line 487, in main text_clustering(raw_data, pagename) File "process.py", line 402, in text_clustering X = get_features_lsa(tf) File "process.py", line 349, in get_features_lsa X = lsa.fit_transform(tf) File "C:\Python36\lib\site-packages\sklearn\pipeline.py", line 301, in fit_tra nsform Xt, fit_params = self._fit(X, y, fit_params) File "C:\Python36\lib\site-packages\sklearn\pipeline.py", line 234, in _fit Xt = transform.fit_transform(Xt, y, fit_params_steps[name]) File "C:\Python36\lib\site-packages\sklearn\decomposition\truncated_svd.py", l ine 159, in fit_transform U, Sigma, VT = svds(X, k=self.n_components, tol=self.tol) File "C:\Python36\lib\site-packages\scipy\sparse\linalg\eigen\arpack\arpack.py ", line 1714, in svds raise ValueError("k must be between 1 and min(A.shape), k=%d" % k) ValueError: k must be between 1 and min(A.shape), k=200

Thank you!

tmoravec commented 7 years ago

Hi, the problem is that the page you are analyzing doesn't contain enough data. There are too few posts, and they are too similar.

You can see the log line "Tfidf matrix shape: (79, 212)". That means that there are 212 posts with text (after some cleaning up) and it only identified 79 different words that are present often enough that we can perform some statistics on them. The tool then does Latent Semantic Analysis, which is configured to require at least 200 words - and this fails with 79 words. You could lower this number, but you wouldn't get any meaningful results anyway.

You know, I created this tool for analysis of a particular page for a school work in marketing :-) . It works on the "big" pages that have thousands of posts. I don't think I could "fix" it to work with smaller pages, because, like with pretty much all Machine Learning, when there's not enough data, even the smartest algorithms fail.

fiuderazes commented 7 years ago

You are so right. Thank you for the kind answer and for this nice tool :)