shahsohil / DCC

This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper
MIT License
208 stars 53 forks source link

Clustering result problem #5

Closed weishao6hao closed 5 years ago

weishao6hao commented 6 years ago

Hi, thank you for your work. I applied this algorithm to my own data, but most of the data are divided into the first cluster. What is the cause of it, please? What kind of improvement do I need to do?

scottfleming commented 5 years ago

Same thing on my end, actually: one cluster that basically contains all of the ~90% data points and then the other 10% of data points are all in their own cluster. I'm running clustering on several different subreddits from reddit, by way of reference. The sample is ~5000 "documents", each with about ~3-10 sentences. My initial featurization is with tf-idf.

shahsohil commented 5 years ago

I believe it could be normalisation issue. Please see #13

ilyak93 commented 3 years ago

Same thing on my end, actually: one cluster that basically contains all of the ~90% data points and then the other 10% of data points are all in their own cluster. I'm running clustering on several different subreddits from reddit, by way of reference. The sample is ~5000 "documents", each with about ~3-10 sentences. My initial featurization is with tf-idf.

Was it a normalization issue ? Did it solve it ?