juanshishido / okcupid

Analyzing online self-presentation
MIT License
5 stars 0 forks source link

determining `k` #3

Closed juanshishido closed 8 years ago

juanshishido commented 8 years ago

We talked about the within cluster sum of squares as a metric, which we will try.

Another option is the silhouette coefficient.

In general, intrinsic methods[, such as the silhouette coefficient,] evaluate a clustering by examining how well the clusters are separated and how compact the clusters are.

(Han, Kamber, Pei)

Another potentially useful link: Selecting the number of clusters.

Both of these approaches concentrate on the clusters themselves. However, we can focus on "downstream" outcomes and, instead, choose k based on how distinct the resulting clusters are based on the dimension (e.g., ethnicity) we're interested in investigating. This still requires a metric.

matarhaller commented 8 years ago

Interesting! It seems like the silhouette coefficient takes both the within and between cluster distance into account, so it's probably a better metric than what I proposed this morning.

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b).

On the one hand, I like the idea of choose k based on downstream measures (ie how well the clusters map to ethnicity), but don't you think that could also be construed as circular? We would be building our model (in this case clustering) based on what we want the clusters to be in advance. I guess what I'm wondering is how valid our interpretation of different ethnicities writing differently would be if we introduced the divisions based on that assumption to begin with. I'm not sure if I'm making sense, but it seems a little weird to me.

juanshishido commented 8 years ago

Thanks for bringing this up! It's a great point and worth discussing. And, yes, you are making sense.

I think it could go either way, but I'd lean toward not thinking of it as circular because we're only interested in choosing the number of clusters. If we were choosing the features for clustering on, on the other hand, then I'd be more hesitant. Still, I do see your point and I appreciate the skepticism!

@jnaras what do you think?

I'm perfectly fine for us to use the silhouette coefficient.

jnaras commented 8 years ago

Sorry I've been so MIA. Finally finished up some other things!

I'm more inclined to let the data speak for itself in terms of deciding k. It sounds a little circular to me. But it's not too bad because it's all pretty arbitrary.

juanshishido commented 8 years ago

Thanks, @jnaras! Let's stick with the silhouette coefficient for now.

juanshishido commented 8 years ago

Nice code in cd03df2d15c480850398a6db17e7520afb69365b for this. We can modify slightly to choose the k with the highest silhouette score. Once that's done, we can close this.

jnaras commented 8 years ago

That makes a lot of sense :)

matarhaller commented 8 years ago

Sine it seemed that the score varied a lot between runs for the same number of clusters (pretty sensitive to the instantiation parameters), we might want to cross validate to figure out a reasonable number of clusters. Thoughts? On Nov 25, 2015 2:39 AM, "Juan Shishido" notifications@github.com wrote:

Nice code in cd03df2 https://github.com/juanshishido/okcupid/commit/cd03df2d15c480850398a6db17e7520afb69365b for this. We can modify slightly to choose the k with the highest silhouette score. Once that's done, we can close this.

— Reply to this email directly or view it on GitHub https://github.com/juanshishido/okcupid/issues/3#issuecomment-159567059.

jnaras commented 8 years ago

Yes. Let's run it 10 times for each k and take an average. That sounds reasonable.

juanshishido commented 8 years ago

^ Yes!

matarhaller commented 8 years ago

Okay, I can take care of that hopefully tonight On Nov 25, 2015 10:01 AM, "Juan Shishido" notifications@github.com wrote:

^ Yes!

— Reply to this email directly or view it on GitHub https://github.com/juanshishido/okcupid/issues/3#issuecomment-159688593.

matarhaller commented 8 years ago

Our data is big, so we need to do sample the data when calculating the silhouette score. This means that on each run we get a different score, even for the same clusters. Note that sometimes calculating the score fails I coded it up so that for every cluster size (ie number of clusters), it calculates the clusters 10 times. For each of those clusterings, it calculates a score 5 times and averages those 5 scores. It then takes the best clusterer out of those 10 runs. I end up with a dictionary of clusterers and scores for each cluster size. I'm not sure if that's confusing or the best way of doing it. I'll push the code once I make sure it actually runs.

juanshishido commented 8 years ago

That makes sense, if I'm understanding it correctly. Let's see if I am.

We calculate the clusters 10 times because of the differences in the initialization. For cluster size k=2 at time t=1, the users are put into one of two clusters. The silhouette score is calculated 5 times with sampling and the values are averaged together. The same is done for t=2, ..., 10. This results in 10 averaged silhouette scores for k=2. The best of those 10 is assigned to k=2, which is stored in a dictionary. We do the same for k=3, ..., n.

We then select the k with the best averaged silhouette score.

Did I get close?

matarhaller commented 8 years ago

Yes, exactly. At least that's what I hope my code is doing :) On Nov 27, 2015 10:58 AM, "Juan Shishido" notifications@github.com wrote:

That makes sense, if I'm understanding it correctly. Let's see if I am.

We calculate the clusters 10 times because of the differences in the initialization. For cluster size k=2 at time t=1, the users are put into one of two clusters. The silhouette score is calculated 5 times with sampling and the values are averaged together. The same is done for t=2, ..., 10. This results in 10 averaged silhouette scores for k=2. The best of those 10 is assigned to k=2, which is stored in a dictionary. We do the same for k=3, ..., n.

We then select the k with the best averaged silhouette score.

Did I get close?

— Reply to this email directly or view it on GitHub https://github.com/juanshishido/okcupid/issues/3#issuecomment-160190096.

matarhaller commented 8 years ago

It looks like there is a problem with kmeans - all the labels are 0 except for one each of 1, 2, 3 and 4. This is for a silhouette score of about 0.97. The thing is that all of the scores are that high.

So I'm not sure if we should:

  1. Rethink the silhouette score as a metric
  2. Rethink kmeans
  3. Something else.

As is, the labels aren't useful. Ideas?

jnaras commented 8 years ago

Oh dear. Maybe the k-means initialization is off? It sounds like one cluster is put in the middle of every data point, and the other four are placed with outliers?

jnaras commented 8 years ago

We could also use the within-cluster variance instead of the silhouette coefficient to determine how good the clusters are. But at the moment it sounds like a k-means init problem

matarhaller commented 8 years ago

Yes, that sounds like it might be the issue. But then you can imagine that maybe with more clusters it would break up the big cluster while still accounting for outliers. I'm not quite sure what to do about it except for maybe trying even more clusters (right now the optimal number is 5 - although the difference in the score is negligible). Alternatively we can try a different clustering method or a different distance metric.

jnaras commented 8 years ago

Not necessarily, silhouette seems to just account for distance between the clusters. If there are a few outlying clusters farther away from all the other points, it makes sense that a higher score might be given to fewer clusters.

We could instead use the variance within clusters as a metric?

Yes, are we using Euclidean distance as the distance metric?

matarhaller commented 8 years ago

Yes, it's euclidian. Should we try something different?

jnaras commented 8 years ago

I think Euclidean makes the most sense logically. I think most people use Euclidean distance as well, otherwise I'd be even more at a loss as to how to cluster the vectors.

Mmm. Okay. Let's try using 'random' instead of kmeans++, run it 10 times for each k, and then average silhouette scores?

matarhaller commented 8 years ago

Right now it's running 50 times per each k (using the n_init argument). I can try running it manually 10 times (in a loop), but that's slower and I think that we can probably rely on n_init. But I'll try using random instead of kmeans++

jnaras commented 8 years ago

Yes, that sounds good. I forgot about the n_init argument :) Hopefully that introduces enough variation to change up the clusters a bit.

matarhaller commented 8 years ago

With the random initiation, the silhouette scores are much lower. Now 3 clusters are the best, but the labels are still weird. There are 9230 in one cluster, 48,588 in another cluster, and none in the third cluster.

jnaras commented 8 years ago

Hm. Weird. I'll think about this a little more then. Thanks for trying it out!

matarhaller commented 8 years ago

I might try other clustering methods too

matarhaller commented 8 years ago

Good call @juanshishido to look at what was different about the instances that were clustered separately. Those people only used 1 word across all their essays...

When I filtered out people that used <100 words, then ran PCA and used Kmeans, the silhouette scores dropped dramatically. They went from about 0.97 to <0.1. So looks like tackling this from new directions is definitely a good idea!

jnaras commented 8 years ago

Wow. Should we get rid of those?

matarhaller commented 8 years ago

Sorry I edited my comment probably after you replied. I dropped any row that had a total essay length of <100 and the scores dropped significantly. So I think we should try new things...

matarhaller commented 8 years ago

KModes is really really slow. For n_clusters ranging from 3 to 6, the costs are really high: 3 = 7021455 4 = 7004895 5 = 7009490 6 = 6994309

To get these values it had to run for >1 hr. I might run it overnight with 1:10 clusters and see where it gets us, but my feeling is that we might want to put this aside for now.

jnaras commented 8 years ago

So there is no bend to these costs either then. Hm. But the clusters are less uneven?

matarhaller commented 8 years ago

Yes, definitely less uneven. But still not sure what to make of it. screen shot 2015-12-03 at 3 45 05 pm

jnaras commented 8 years ago

Okay, let's just leave it at that then. Thanks for trying it out though! It sounded really annoying.

If you would like, you could try running it over night for the other values of k, but if you don't get to it that's totally fine. You've already done so much!

matarhaller commented 8 years ago

Very annoying - but I'll live ;)

I was going to try to run it overnight on our cluster, but since I don't have root access I can't install the kmodes package. I'll see if I can figure out how to do it, but I'm not making it a huge priority since this doesn't seem like it will lead to anything useful anyway.