Colab notebook crashing while calculating PCA/K-Means. CSV file contains 80,000+ rows!

jbesomi / texthero

Text preprocessing, representation and visualization from zero to hero.

https://texthero.org

MIT License

2.89k stars 239 forks source link

Colab notebook crashing while calculating PCA/K-Means. CSV file contains 80,000+ rows! #73

Open vidyap-xgboost opened 4 years ago

vidyap-xgboost commented 4 years ago

Hello,

I'm trying to visualize Kmeans for the dataset I have which has 80K+ rows with 9 columns.

The notebook keeps crashing whenever I try to run this particular code:

#Add pca value to dataframe to use as visualization coordinates
df1['pca'] = (
            df1['clean_tweet']
            .pipe(hero.tfidf)
            .pipe(hero.pca)
   )
#Add k-means cluster to dataframe 
df1['kmeans'] = (
            df1['clean_tweet']
            .pipe(hero.tfidf)
            .pipe(hero.kmeans)
   )
df1.head()

Is it because texthero can't handle that many rows yet? Any other solution?

cmhashim commented 4 years ago

Same happened even for me, and I assumed its because of "large data and Colab"

jbesomi commented 4 years ago

Hi!

This is a known (current) limitation of Texthero. It will be fixed soon in the next releases (Texthero is still in Beta).

The problem arises on the tfidf part, by default max_features is None, meaning a giant matrix doc-term occurrences is created. This is by default sparse, but as of now Texthero convert it into a dense matrix (to be saved as a Pandas Series of list and to be passed into pca)

For now, you should be able to solve the problem by replacing ".pipe(hero.tfidf)" with ".pipe(hero.tfidf, max_features=300)" (any value between 100-300 is ok)

Let me know if that works, in future releases we will develop a different solution that will return a Sparse Pandas Series, see #43

vidyap-xgboost commented 4 years ago

Hi!

This is a known (current) limitation of Texthero. It will be fixed soon in the next releases (Texthero is still in Beta).

The problem arises on the tfidf part, by default max_features is None, meaning a giant matrix doc-term occurrences is created. This is by default sparse, but as of now Texthero convert it into a dense matrix (to be saved as a Pandas Series of list and to be passed into pca)

For now, you should be able to solve the problem by replacing ".pipe(hero.tfidf)" with ".pipe(hero.tfidf, max_features=300)" (any value between 100-300 is ok)

Let me know if that works, in future releases we will develop a different solution that will return a Sparse Pandas Series, see #43

[UPDATE]

I tried setting max_features=300 and it worked for 80k+ tweets!

This is a workaround for now.