Clustering Result - Githubissues

royxue commented 9 years ago

In order to make what we do more accurate, I tried with different method.

I use pam to get the silhouette score, which is "Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of K-means.". in that result it returns a good result with cluster number of 4.
I tried to use original kmeans to get the result, even I make it loop 100 times, result is random, and not clearly enough to get best answer.
for clusGap, first I uses the origin R function clusGap to get the result ALSO I tried another method to get clusGap, from https://github.com/echen/gap-statistic. The result from these 2 method are similar. so I think there is no problem with the clusGap result.

So I think if there are some unavailable situation that clusGap is not accurate, but there is nothing about this in the paper(uploaded as gap.pdf).

Whats your opinion?

royxue commented 9 years ago

@julian-ramos

julian-ramos commented 9 years ago

Questions for each of the items above:

in 1. You say that the result is good. What do you mean by that?

in 2. What do you mean by result is random? What do you mean by not clearly enought to get best answer? This does not make any sense.

in 3. Hard to say that the results are the same. First of all they are not in the same scale, second you are not drawing the confidence interval or standard deviation so we cannot really compare that. Also numerically they are actually different.

Your last sentence: So I think if there are some unavailable situation that clusGap is not accurate. I don't understand what you are trying to say.

Please take more time to write this reports it is really hard to understand them.

Please do the next: Add the standard deviation whiskers/confidence interval/standard deviation to the kmeans graph. For all of the metrics use the same range for k, so far you tried 2 to 50 for one version of gap, 2 to 30 for the other and 2 to 20 for kmeans. I want you to try two more clustering algorithms: DbScan and Hierarchical clustering. For them also run gap and silhouette. Once you are done with all this I want you to go beyond making graphs and reporting numbers. What is your interpretation of the results? What are the centroids discovered by each Clustering method for the best K found with gap or silhouette?

You need to dig deeper on the data! Analyze your results!

Thanks,

royxue commented 9 years ago

@julian-ramos Im so sorry, I didn't write what I want to say clearly, I will rewrite them in a more accurate way about what and why.

julian-ramos commented 9 years ago

One more thing!, Don't take this as a reprimand because it is not. I just want you to think more about this project and what you are doing.

Best,

royxue commented 9 years ago

@julian-ramos Oh, I knew that, thank you very much.

Yesterday, I think a lot about what I did, and go back to check the results I got. I think I did make some mistakes, also there are things I should pay attention to. I have finished all my master degree applications, and I will concentrate on our projects.

For Question 1 yesterday, about find the best k by silhouette score, what we wanna do is implement the cluster algorithm, calculate silhouette score and draw a graph . Then we can find the best K from the Peak(Max Value) of the graph. So as result for pam algorithm, we can see the best K is 4. Then I uses library(fpc) to implement kmeans algorithm and draw the result graph again. From this graph we can see its best K is also 4.

Next I will recheck the clusGap part.

PS: Is the explanation clear?

julian-ramos commented 9 years ago

Your explanation is more clear however, what you are doing is not entirely correct.

When you use Kmeans( probably k-medoids too), you know the initialization of the algorithm is random. If you didn't know please check in Wiki how Kmeans work. Due to this initialization, and also that the EM algorithm does not guarantee convergence to the global maxima, when you run kmeans this solution is not unique!. This is why you have to run kmeans multiple times to confirm that the results you are getting, despite not been the same are similar and stable. In our specific case we want to find out our K so we have to run kmeans multiple times, and check both the maximum average metric score(metric=silhouette or gap) for different k's and the standard deviation which will tell us if this result is stable. You are still missing, according to your last email measuring the standard deviation. Unless of course std is zero though now you should realize why that is unlikely.

I will be submitting your recommendation letters tonight.

Regards,

On Thu, Dec 11, 2014 at 9:22 PM, Roy Xue notifications@github.com wrote:

@julian-ramos https://github.com/julian-ramos Oh, I knew that, thank you very much.

Yesterday, I think a lot about what I did, and go back to check the results I got. I think I did make some mistakes, also there are things I should pay attention to. I have finished all my master degree applications, and I will concentrate on our projects.

For Question 1 yesterday, about find the best k by silhouette score, what we wanna do is implement the cluster algorithm, calculate silhouette score and draw a graph . Then we can find the best K from the Peak(Max Value) of the graph. So as result for pam algorithm, we can see the best K is 4. Then I uses library(fpc) to implement kmeans algorithm and draw the result graph again. From this graph we can see its best K is also 4.

Next I will recheck the clusGap part.

PS: Is the explanation clear?

— Reply to this email directly or view it on GitHub https://github.com/RoyXue/MobileAnalysis/issues/4#issuecomment-66723582.

Julian.

royxue commented 9 years ago

@julian-ramos I just uploaded the mean and standard deviation result of kmeans silhouette result(run 50 times).

julian-ramos commented 9 years ago

Hi Lijun,

Thanks, that is all good now, however, mix both mean and standard deviation results. We need to be able to see whether the standard deviation of each K overlaps with the mean of other K's. The graph that we need should look like the gapClus.png where we have both the means and the standard deviation.

On Fri, Dec 12, 2014 at 11:30 AM, Roy Xue notifications@github.com wrote:

@julian-ramos https://github.com/julian-ramos I just uploaded the mean and standard deviation result of kmeans silhouette result(run 50 times).

— Reply to this email directly or view it on GitHub https://github.com/RoyXue/MobileAnalysis/issues/4#issuecomment-66796550.

Julian.

royxue commented 9 years ago

@julian-ramos Hi, Julian I think mix those two graph into one is a good idea, but as there two values have different y-axis scale(mean from 0.26-0.40, sd from 0.01-0.04) so it's hard to draw them in a graph.

But if we just wanna use the graph relationship not the real value relationship, I think I draw mean value and 10*sd value in one graph. will this be ok?

julian-ramos commented 9 years ago

nonono Look at gapClus.png, the whiskers there are the std and they are centered on the mean, that is what we need to see.

On Fri, Dec 12, 2014 at 11:49 AM, Roy Xue notifications@github.com wrote:

@julian-ramos https://github.com/julian-ramos Hi, Julian I think mix those two graph into one is a good idea, but as there two values have different y-axis scale(mean from 0.26-0.40, sd from 0.01-0.04) so it's hard to draw them in a graph.

But if we just wanna use the graph relationship not the real value relationship, I think I draw mean value and 10*sd value in one graph. will this be ok?

— Reply to this email directly or view it on GitHub https://github.com/RoyXue/MobileAnalysis/issues/4#issuecomment-66799439.

Julian.

royxue commented 9 years ago

@julian-ramos Oh, I got what you mean!

royxue commented 9 years ago

@julian-ramos I made the boxplot graph. Here you can check the graph at

It shows following information:

the origin plot box http://en.wikipedia.org/wiki/Box_plot the minimum and maximum of all values (the highest and lowest line) the 25% - 75% value(box) the median (the band inside the box) strange value/too big or too small(points)

2.the boxplot with mean and standard deviation the mean (the band inside the box) the mean with standard deviation(box) the minimum and maximum of all values (the highest and lowest line)

royxue commented 9 years ago

@julian-ramos I uploaded the Hierarchical clustering graph, and heatmap. For the DBscan, I use dbscan inside the package(fpc), but the result is not what we want, I will spend more time on this to figure out how it works.

Im getting fever, Im going to sleep early tonight.I will write interpretation of the results a little bit latter. on tomorrow morning my time.

julian-ramos commented 9 years ago

I hope you get better soon. I think once you are done with DBScan we can talk about what I want to do next. Though for that I need to know if you want to keep working on this project and for how long.

royxue commented 9 years ago

Yes, I want to continue working on this project. It's very interesting, and there are things I want to learn. Now my master degree applications are all finished, so I could spend more time on it.

BTW, to be honest, sorry for last several weeks. I spent too much time on applications.

royxue commented 9 years ago

some interpretation of the kmeans results: from kmeans silhouette: In order to find the best K to implement Kmeans to our data, we need to find the K with highest silhouette score. From the result graph, we can see when K=4, we get the best silhouette score. from kmeans_boxplot_mean We can see that when k = 4, we get the best silhouette score, however, the standard deviation is ..., that means the result of k=4 is not stable. Then let's turn to look at kmeans_boxplot_median.png which shows the bad result(points outside the box), under the situation of k=4, there are several cases we will get a very low silhouette score. These cases make the standard deviation value become big.

julian-ramos commented 9 years ago

Hi Roy,

Your interpretation is wrong, the standard deviation in the case of k=4 for kmeans using the means is not too bad. K=1 is better however that is not what we are interested on. Also, the standard deviation in this case while it has some over lap with the standard deviation of k=3 that overlap is small and that is what were after. I suggest you read about what it is a box plot and what it means, also I suggest you do not come up with explanations or paraphrase what I wrote earlier.

For the kmeans median case, the results are much better, the points that you see are outliers so it is still ok as it still tell us that the right number is 4 however we have to be careful with which model we pick.

Having a small silhouette score does not mean we will have a big standard deviation. Check the definition and try to understand why.

Last, now we know it is 4 the number of centroids however we also know not all of the kmeans results are good so, run again kmeans for k=4 say 50 times, then pick the clustering that gives you the best silhouette score. Check the centroids and tell me what they are.

On Fri, Dec 19, 2014 at 12:29 AM, Roy Xue notifications@github.com wrote:

some interpretation of the kmeans results: from kmeans silhouette: In order to find the best K to implement Kmeans to our data, we need to find the K with highest silhouette score. From the result graph, we can see when K=4, we get the best silhouette score. from kmeans_boxplot_mean We can see that when k = 4, we get the best silhouette score, however, the standard deviation is ..., that means the result of k=4 is not stable. Then let's turn to look at kmeans_boxplot_median.png which shows the bad result(points outside the box), under the situation of k=4, there are several cases we will get a very low silhouette score. These cases make the standard deviation value become big.

— Reply to this email directly or view it on GitHub https://github.com/RoyXue/MobileAnalysis/issues/4#issuecomment-67603171.

Julian.

royxue commented 9 years ago

@julian-ramos Hi, Julian,

1.For K=1, actually, to make the graph beautiful, I set all value in K=1 as the same value, I didn’t calculate the value when K=1, or can say the function does’t permit K=1.

2.For the box plot, the origin box plot shows the median and 25%-75% box. So another box plot shows the mean value with mean +- standard deviation is the on I defined by my self(R don’t support this kind graph.) And in this graph, the mean value and standard deviation are both calculated based on values we get from run 50 time. So if there is a low silhouette(x) will lead to high standard deviation value(x-mean is very big now), but if we ignore these bad cases, the result is not bad.

3.For further, I will do dbscan first then get the centroids information when K=4

royxue commented 9 years ago

@julian-ramos Another thing is, That I have read about the dbscan wiki

from wiki: => "Especially for high-dimensional data, this metric can be rendered almost useless due to the so-called "Curse of dimensionality", making it difficult to find an appropriate value for ε"

So I think the dbscan maybe not perform well in our case. That's why I asked those questions yesterday.

royxue commented 9 years ago

@julian-ramos I just uploaded the centroids information when K=4. I run 50x3 times, and get three results with same centroids. you can see the results in kmeans_k4_result.txt

julian-ramos commented 9 years ago

Hi Roy,

K=1 does not even makes sense is not helpful and there was no need to include it.

I know what a boxplot is and it seems you do too however based on that still your questions do not make sense neither your claims. Why do you claim a low Silhouette will produce a high standard deviation? why would you say k=4 for the mean case produces an un-stable result? Similarly you said the standard deviation is low or high what do you mean by that?

About DBScan Curse of dimenssionality applies for thousands of features we barely have 20 so no I does not apply in our case. Your questions were about the code itself not about DBscan.

On Fri, Dec 19, 2014 at 6:08 PM, Roy Xue notifications@github.com wrote:

@julian-ramos https://github.com/julian-ramos Another thing is, That I have read about the dbscan wiki

from wiki: => "Especially for high-dimensional data, this metric can be rendered almost useless due to the so-called "Curse of dimensionality", making it difficult to find an appropriate value for ε"

So I think the dbscan maybe not perform well in our case. That's why I asked those questions yesterday.

— Reply to this email directly or view it on GitHub https://github.com/RoyXue/MobileAnalysis/issues/4#issuecomment-67721166.

Julian.

royxue commented 9 years ago

@julian-ramos Hi, Julian Here I mean the standard deviation low/high means the data points close to the mean or not. low silhouette here I mean the bad cases' silhouette score are far from mean. The unstable I mentioned it's that there are one or two bad cases when we run the kmeans for 50 times, though it's doesnt matter a lot. I think maybe I use a wrong word to describe this.

About the DBScan, I will look at how to define E and MinPts to get good answers.

julian-ramos commented 9 years ago

Hi Lijun,

I don't think I understand what you did. Was it the next: You ran kmeans for k=4 50 times then picked the best centroids and you did all this 3 times?

Certainly the centroids look similar but did you confirm they are the same?

Earlier I said we have 20 features but actually yes we have 3 times that I forgot about the morning, afternoon and evening components. Sorry about that, though the curse of dimensionality still does not apply to us.

On Fri, Dec 19, 2014 at 6:28 PM, Roy Xue notifications@github.com wrote:

@julian-ramos https://github.com/julian-ramos I just uploaded the centroids information when K=4. I run 50x3 times, and get three results with same centroids. you can see the results in kmeans_k4_result.txt

— Reply to this email directly or view it on GitHub https://github.com/RoyXue/MobileAnalysis/issues/4#issuecomment-67721806.

Julian.

royxue commented 9 years ago

@julian-ramos Yes, you are right. 1 loop is "run kmeans k=4 50 times, get the best centroids", and I did 3 loop

The result are same, I check the exact value of every centroids every loop.

julian-ramos commented 9 years ago

Hi Lijun,

I will answer online in your email:

On Fri, Dec 19, 2014 at 6:42 PM, Roy Xue notifications@github.com wrote:

@julian-ramos https://github.com/julian-ramos Hi, Julian

Here is where you interpretation is wrong, when we compute the average and standard deviation we are for a moment assuming our data is normally distributed. Now, following that assumption we are interested on showing that a result for a given clustering is far enough from other results to determine then that our result of interest is in fact unique. As a counter example, when the standard deviation completely overlaps that of another result, this is basically telling us that our results could come from either clustering.

Here I mean the standard deviation low/high means the data points close to the mean or not. low silhouette here I mean the bad cases' silhouette score are far from mean.

This is why I tell you that you have to be precise and careful with your use of words. Nonetheless, the results are fairly stable in most cases

The unstable I mentioned it's that there are one or two bad cases when we run the kmeans for 50 times, though it's doesnt matter a lot. I think maybe I use a wrong word to describe this.

This is almost the right question, here is the right way to ask about it: Is there any method to calculate the best values for E and MinPts for DBScan?

Now about the answer, I do not really know what the best parameters for these are. However, we can make some educated guesses and then use Silhouette to determine the best set of parameters! For instance, for E or (epsilon) we could look at the minimum distance between any data point in our data set and make epsilon a bit higher than that value. For MinPts, we could start from say 2 and then try 4, 6, 8 and

Then you could plot the averaged silhouette scores and stds and from that graph determine what are the best parameters for DBScan.

About the DBScan, I will look at how to define E and MinPts to get good answers.

— Reply to this email directly or view it on GitHub https://github.com/RoyXue/MobileAnalysis/issues/4#issuecomment-67722178.

Julian.

julian-ramos commented 9 years ago

Excellent. Now here a way to summarize later all of the results.

First of all, take the data set we used for the clustering and plot it as a heatmap. This is the first graph I want to see Second, for each set of results from all of the algorithms explored (kmedoids, kmeans, hierarchical clutering and DBScan) create heatmaps as well, Simply put them one on top of the other as subplots. This so that we can visualize the differences. Also, sort them using one of the features, In this way we should be able to compare more or less all of the different results. Last!, which is the most important part, describe with words the results from each algorithm.

Thanks,

On Fri, Dec 19, 2014 at 6:46 PM, Roy Xue notifications@github.com wrote:

@julian-ramos https://github.com/julian-ramos Yes, you are right. 1 loop is "run kmeans k=4 50 times, get the best centroids", and I did 3 loop

The result are same, I check the exact value of every centroids every loop.

— Reply to this email directly or view it on GitHub https://github.com/RoyXue/MobileAnalysis/issues/4#issuecomment-67722314.

Julian.

royxue commented 9 years ago

Ok, I will reply you as soon as I finish these things.

BTW, hmmm, I will pay more attention on how to write accurate explanation in English, I feel sometimes it's a little bit difficult to write what exactly as I thought.

julian-ramos commented 9 years ago

It is ok, you are getting there and also you are learning new terms and the right vocabulary.

On Fri, Dec 19, 2014 at 7:07 PM, Roy Xue notifications@github.com wrote:

Ok, I will reply you as soon as I finish these things.

BTW, hmmm, I will pay more attention on how to write accurate explanation in English, I feel sometimes it's a little bit difficult to write what exactly as I thought.

— Reply to this email directly or view it on GitHub https://github.com/RoyXue/MobileAnalysis/issues/4#issuecomment-67722834.

Julian.

royxue commented 9 years ago

Thanks :)

royxue / MobileAnalysis

Clustering Result #4