jbesomi / texthero

Text preprocessing, representation and visualization from zero to hero.
https://texthero.org
MIT License
2.88k stars 240 forks source link

-1 dbscan category #199

Open foongminwong opened 3 years ago

foongminwong commented 3 years ago

Hi, I was trying to run dbscan on some texts and create a scatterplot.

I wonder why my dbscan_labels has a -1 category (not sure what it means):

documents['dbscan_labels'] = (
    documents['tfidf']
    .pipe(hero.dbscan)
    .astype(str)
)

hero.scatterplot(df=documents, col='pca', color='dbscan_labels', hover_data=['ID', 'Title'], title=" DBScan Clustering (Test) - Texthero library")

image

I tried running using k-means previously and the clusters/scatter plot look good:

documents['tfidf'] = (
    documents['Text']
    .pipe(hero.clean)
    .pipe(hero.tfidf)
)

documents['kmeans_labels'] = (
    documents['tfidf']
    .pipe(hero.kmeans, n_clusters=13)
    .astype(str)
)

documents['pca'] = documents['tfidf'].pipe(hero.pca)

hero.scatterplot(df=documents, col='pca', color='kmeans_labels', hover_data=['ID', 'Title'], title="K-Means Clustering (Test) - Texthero library")

image

Thank you!

jbesomi commented 3 years ago

Hi @foongminwong, thank you for reaching out!

DBSCAN classify points into different classes, one of which is "noise point" / outliers. -1 indicates that these points have been classified as such from your DB algorithm.

We will need to update the docstring of the texthero.representation.dbscan function and make it more explicit. Would you like to help us with that?