PB-k means something / hc issue

alexdb27 commented 9 years ago

Hello, cc @pierrepo @jbarnoud @HubLot

as suggested by the Ammmmmaaaaaaziing Spider @jbarnoud, i open a new issue. I'm a big fan of hierarchical clustering (hc) as it is visually very simple to handle. Nonetheless, i've express in other issues as #62 #63 the fact that hc is perhaps not appropriate when we have thousands of snapshots to compare. With hc, you need to compare N x (N-1 / 2) to create the distance matrix (and you need to store it), then you -on average- N x (N/2) computation to do the dendrogram, so we go easily to O(n^3). We currently succeed to have results for simulations of 50 or 100 ns, but when we merge it to 850 ns... no results. k-means algorithm is well known and appreciated, it needs a fixed number k of cluster (in fact as the hc when we want to analyze) and then you compute only N distances 20 times (and an average each times). It is so quite fast. Ok, the drawback is that it needs original initialization values for the cluster. In R, at the beginning, it was an issue. It is no more, especially when you have a lot of data. I would be please if it could be use with Python with scipy.

Please share your through.

PS: a small idea. If size of data is an issue ;-), perhaps we can (i) fix a maximum and/or (ii) if it is more than threshold only take one snapshot every x snaps.

pierrepo commented 9 years ago

Hello,

I agree that k-means clustering is very interesting.

Concerning hierarchical clustering, Python might deal with memory differently from R and could be able to deal with a lot of conformations.

I propose that we implement the clustering with the hierarchical clustering in Python first and then try the k-means clustering.

HubLot commented 9 years ago

For the clustering in Python, another library for the clustering in Python is scikit-learn. It's quite a dependency but it provides more clustering methods (at least k-means which scipy doesn't have). It's worth testing. And maybe in the future, it could be interesting to let the user choose the clustering algorithm (juste an idea).

The critical point of going to Python (scipy or scikit-learn) is to have the same results as in R for the results (for HC). psi_md_traj_1.pdb could be a good example.

@alexdb27, what do you mean you don't have any results from a 850ns file ? R is crashing ? Is it due to RAM usage? Or it's the storage of distance matrix ? It will be interesting to have the file to see how the python implementation of the clustering is doing.

HubLot commented 9 years ago

The critical point of going to Python (scipy or scikit-learn) is to have the same results as in R for the results (for HC). psi_md_traj_1.pdb could be a good example.

At least, on psi_md_traj_1.pdb, scikit-learn & R gave the same results (just did a quick test)

jbarnoud commented 9 years ago

On 04/05/15 17:01, Hub wrote:

The critical point of going to Python (scipy or scikit-learn) is
to have the same results as in R for the results (for HC).
psi_md_traj_1.pdb could be a good example.
At least, on |psi_md_traj_1.pdb|, scikit-learn & R gave the same results (just did a quick test)

— Reply to this email directly or view it on GitHub https://github.com/pierrepo/PBxplore/issues/64#issuecomment-98741286.

That's cool! Could you paste the code of your test? Knowing you it may even be a notebook...

HubLot commented 9 years ago

Indeed, here it is: http://nbviewer.ipython.org/gist/HubLot/9e0f76bc987489aedabe

The downside, for now, is it's not possible to have directly medoid in scikit-learn with hclust. I search for an alternative way.

jbarnoud commented 9 years ago

Cool !

I added scipy to the notebook and it gives the same clusters as the others for 3 clusters. For 4 clusters, however, scipy and scikit-learn agree with each other but R disagrees a bit.

http://nbviewer.ipython.org/gist/jbarnoud/7e9ea4362e948fe41dea

HubLot commented 9 years ago

Interesting. I computed the medoids in the same way as the R script. I updated the gist

alexdb27 commented 9 years ago

Concerning the 850 ns... it works ...it works ...it works ...it works ... night and day without any core dumped / crash or output... My guess, RAM issue for R

jbarnoud commented 9 years ago

I updated my notebook to include your medoid function. I also increased the number of requested clusters to 5, showing more discrepancy between scipy and scikit-learn on one side, and R on the other side.

It should be noted that I use R version 2.14.1 (2011-12-22) that only have the 'ward' method.

jbarnoud commented 9 years ago

I updated my notebook again to use the newest version of R (3.2.0 (2015-04-16)). The 'ward.D2' method gives an even more different result.

Also, I encounter issue #66.

HubLot commented 9 years ago

Ouch... By looking the source code, the ward Hclust in scikit-learn is based on the scipy one, hence the same results. But for R... After digging a little bit, maybe we miss used scipy/sklearn, see: http://stackoverflow.com/questions/18952587/use-distance-matrix-in-scipy-cluster-hierarchy-linkage/18954990#18954990 https://github.com/scipy/scipy/issues/2614

Concerning the 850 ns... it works ...it works ...it works ...it works ... night and day without any core dumped / crash or output... My guess, RAM issue for R

Strange. Could you provide me the file ? I could try to see where it bugs.

alexdb27 commented 9 years ago

Ask Matthieu G., he had the files ... PS: it is not Ali G. [you can found 7 differences)

alig

HubLot commented 9 years ago

Thanks. About the R methods, see #66

HubLot commented 9 years ago

To sum up the results about hierarchical clustering in R vs Python (scipy), I made a notebook.

Basically:

matrix input of scipy functions are different from R.
Ward with distance matrix in scipy is not possible.
average & complete gave the same results for R and scipy.

jbarnoud commented 9 years ago

Great test! I am quite disappointed about scipy but there are other options to use ward with python outside of scipy. The question now would be what criterion should we use to compare the clustering methods and figure out which one is the more appropriate?

HubLot commented 9 years ago

I updated the notebook with scikit-learn as the input is different. This doesn't change the conclusion. I agree with the questions raised by @jbarnoud

alexdb27 commented 9 years ago

Excellent work @HubLot , I really like your notebook. What is nice with Ward is the fact that the clusters are well balanced. What is nice with average is the fact that clusters are really based on a "natural rule", i.e., what is close is close in terms of simple distance. Complete is not too far away from this one. I'm not a big fan of single linkage as it is like onion and onion makes me cry like a river ...

So for me it is complete > average > Ward > linkage

pierrepo commented 9 years ago

Very nice notebooks @jbarnoud and @HubLot Since the complete method gives the same results for HC in R, Python/scipy and Python/scikit-learn, I propose we implement the Python/scipy method in PBxplore (this is Python and it has less dependancies than Python/scikit-learn).

However, I am not sur hierarchical clustering is the best clustering method here. As mentionned by @alexdb27, it is

visually very simple to handle

but I am not sure the visual we can get here is meaningfull. Indeed, the distance we use is quite coarse and I do not know how to interpret the fact that two clusters are close to each other and far from a third one.

So what do you believe is the most userfull to implement in PBxplore?

HC/complete with Python/scipy
K-means with Python/scipy
both ?

In any case, I advocate to remove R from the clustering process. This will be easier to install and to maintain.

alexdb27 commented 9 years ago

Both !!!!

jbarnoud commented 9 years ago

After we discussed about it with @alexdb27 and @HubLot, I started to implement the k-means.

pierrepo commented 9 years ago

Hi @jbarnoud. It is OK with the k-means implementation? I'd like to get ride of R as soon as possible.

alexdb27 commented 9 years ago

https://pl.wikipedia.org/wiki/Kmin_rzymski a new implementation ?

jbarnoud commented 9 years ago

Hey ! Here is a prototype of K-means for PBxplore: https://gist.github.com/jbarnoud/fc27c5048d6e8f394598

This notebook implements the K-means algorithm and tries to visualize the clusters that are produced. I am looking for a way to validate the clustering. Any idea?

@pierrepo @alexdb27 @HubLot

jbarnoud commented 9 years ago

I updated the K-means notebook 2 days ago but I don't know if you got notified.

@pierrepo @alexdb27 @HubLot

pierrepo commented 8 years ago

@HubLot and @jbarnoud you did a great job on this issue. PR #106 is implementing the k-means method. In order not to lose the work you previously did on hc clustering. Could please add to PBxplore a simple Notebook explaining how to make hc clustering with the PBxplore API and either Scipy or scikit-learn? I could be a simple reformating of this Notebook: http://nbviewer.jupyter.org/gist/jbarnoud/7e9ea4362e948fe41dea

pierrepo / PBxplore

PB-k means something / hc issue #64