Open vidyap-xgboost opened 4 years ago
Same happened even for me, and I assumed its because of "large data and Colab"
Hi!
This is a known (current) limitation of Texthero. It will be fixed soon in the next releases (Texthero is still in Beta).
The problem arises on the tfidf part, by default max_features is None, meaning a giant matrix doc-term occurrences is created. This is by default sparse, but as of now Texthero convert it into a dense matrix (to be saved as a Pandas Series of list and to be passed into pca)
For now, you should be able to solve the problem by replacing ".pipe(hero.tfidf)" with ".pipe(hero.tfidf, max_features=300)" (any value between 100-300 is ok)
Let me know if that works, in future releases we will develop a different solution that will return a Sparse Pandas Series, see #43
Hi!
This is a known (current) limitation of Texthero. It will be fixed soon in the next releases (Texthero is still in Beta).
The problem arises on the tfidf part, by default max_features is None, meaning a giant matrix doc-term occurrences is created. This is by default sparse, but as of now Texthero convert it into a dense matrix (to be saved as a Pandas Series of list and to be passed into pca)
For now, you should be able to solve the problem by replacing ".pipe(hero.tfidf)" with ".pipe(hero.tfidf, max_features=300)" (any value between 100-300 is ok)
Let me know if that works, in future releases we will develop a different solution that will return a Sparse Pandas Series, see #43
[UPDATE]
I tried setting max_features=300
and it worked for 80k+ tweets!
This is a workaround for now.
Hello,
I'm trying to visualize Kmeans for the dataset I have which has 80K+ rows with 9 columns.
The notebook keeps crashing whenever I try to run this particular code:
Is it because texthero can't handle that many rows yet? Any other solution?