Open alexdb27 opened 9 years ago
Hello,
I agree that k-means clustering is very interesting.
Concerning hierarchical clustering, Python might deal with memory differently from R and could be able to deal with a lot of conformations.
I propose that we implement the clustering with the hierarchical clustering in Python first and then try the k-means clustering.
For the clustering in Python, another library for the clustering in Python is scikit-learn. It's quite a dependency but it provides more clustering methods (at least k-means which scipy doesn't have). It's worth testing. And maybe in the future, it could be interesting to let the user choose the clustering algorithm (juste an idea).
The critical point of going to Python (scipy or scikit-learn) is to have the same results as in R for the results (for HC). psi_md_traj_1.pdb
could be a good example.
@alexdb27, what do you mean you don't have any results from a 850ns file ? R is crashing ? Is it due to RAM usage? Or it's the storage of distance matrix ? It will be interesting to have the file to see how the python implementation of the clustering is doing.
The critical point of going to Python (scipy or scikit-learn) is to have the same results as in R for the results (for HC). psi_md_traj_1.pdb could be a good example.
At least, on psi_md_traj_1.pdb
, scikit-learn & R gave the same results (just did a quick test)
On 04/05/15 17:01, Hub wrote:
The critical point of going to Python (scipy or scikit-learn) is to have the same results as in R for the results (for HC). psi_md_traj_1.pdb could be a good example.
At least, on |psi_md_traj_1.pdb|, scikit-learn & R gave the same results (just did a quick test)
— Reply to this email directly or view it on GitHub https://github.com/pierrepo/PBxplore/issues/64#issuecomment-98741286.
That's cool! Could you paste the code of your test? Knowing you it may even be a notebook...
Indeed, here it is: http://nbviewer.ipython.org/gist/HubLot/9e0f76bc987489aedabe
The downside, for now, is it's not possible to have directly medoid in scikit-learn with hclust. I search for an alternative way.
Cool !
I added scipy to the notebook and it gives the same clusters as the others for 3 clusters. For 4 clusters, however, scipy and scikit-learn agree with each other but R disagrees a bit.
http://nbviewer.ipython.org/gist/jbarnoud/7e9ea4362e948fe41dea
Interesting. I computed the medoids in the same way as the R script. I updated the gist
Concerning the 850 ns... it works ...it works ...it works ...it works ... night and day without any core dumped / crash or output... My guess, RAM issue for R
I updated my notebook to include your medoid function. I also increased the number of requested clusters to 5, showing more discrepancy between scipy and scikit-learn on one side, and R on the other side.
It should be noted that I use R version 2.14.1 (2011-12-22) that only have the 'ward' method.
I updated my notebook again to use the newest version of R (3.2.0 (2015-04-16)). The 'ward.D2' method gives an even more different result.
Also, I encounter issue #66.
Ouch... By looking the source code, the ward Hclust in scikit-learn is based on the scipy one, hence the same results. But for R... After digging a little bit, maybe we miss used scipy/sklearn, see: http://stackoverflow.com/questions/18952587/use-distance-matrix-in-scipy-cluster-hierarchy-linkage/18954990#18954990 https://github.com/scipy/scipy/issues/2614
Concerning the 850 ns... it works ...it works ...it works ...it works ... night and day without any core dumped / crash or output... My guess, RAM issue for R
Strange. Could you provide me the file ? I could try to see where it bugs.
Ask Matthieu G., he had the files ... PS: it is not Ali G. [you can found 7 differences)
Thanks. About the R methods, see #66
To sum up the results about hierarchical clustering in R vs Python (scipy), I made a notebook.
Basically:
Great test! I am quite disappointed about scipy but there are other options to use ward with python outside of scipy. The question now would be what criterion should we use to compare the clustering methods and figure out which one is the more appropriate?
I updated the notebook with scikit-learn as the input is different. This doesn't change the conclusion. I agree with the questions raised by @jbarnoud
Excellent work @HubLot , I really like your notebook. What is nice with Ward is the fact that the clusters are well balanced. What is nice with average is the fact that clusters are really based on a "natural rule", i.e., what is close is close in terms of simple distance. Complete is not too far away from this one. I'm not a big fan of single linkage as it is like onion and onion makes me cry like a river ...
So for me it is complete > average > Ward > linkage
Very nice notebooks @jbarnoud and @HubLot Since the complete method gives the same results for HC in R, Python/scipy and Python/scikit-learn, I propose we implement the Python/scipy method in PBxplore (this is Python and it has less dependancies than Python/scikit-learn).
However, I am not sur hierarchical clustering is the best clustering method here. As mentionned by @alexdb27, it is
visually very simple to handle
but I am not sure the visual we can get here is meaningfull. Indeed, the distance we use is quite coarse and I do not know how to interpret the fact that two clusters are close to each other and far from a third one.
So what do you believe is the most userfull to implement in PBxplore?
In any case, I advocate to remove R from the clustering process. This will be easier to install and to maintain.
Both !!!!
After we discussed about it with @alexdb27 and @HubLot, I started to implement the k-means.
Hi @jbarnoud. It is OK with the k-means implementation? I'd like to get ride of R as soon as possible.
https://pl.wikipedia.org/wiki/Kmin_rzymski a new implementation ?
Hey ! Here is a prototype of K-means for PBxplore: https://gist.github.com/jbarnoud/fc27c5048d6e8f394598
This notebook implements the K-means algorithm and tries to visualize the clusters that are produced. I am looking for a way to validate the clustering. Any idea?
@pierrepo @alexdb27 @HubLot
I updated the K-means notebook 2 days ago but I don't know if you got notified.
@pierrepo @alexdb27 @HubLot
@HubLot and @jbarnoud you did a great job on this issue. PR #106 is implementing the k-means method. In order not to lose the work you previously did on hc clustering. Could please add to PBxplore a simple Notebook explaining how to make hc clustering with the PBxplore API and either Scipy or scikit-learn? I could be a simple reformating of this Notebook: http://nbviewer.jupyter.org/gist/jbarnoud/7e9ea4362e948fe41dea
Hello, cc @pierrepo @jbarnoud @HubLot
as suggested by the Ammmmmaaaaaaziing Spider @jbarnoud, i open a new issue. I'm a big fan of hierarchical clustering (hc) as it is visually very simple to handle. Nonetheless, i've express in other issues as #62 #63 the fact that hc is perhaps not appropriate when we have thousands of snapshots to compare. With hc, you need to compare N x (N-1 / 2) to create the distance matrix (and you need to store it), then you -on average- N x (N/2) computation to do the dendrogram, so we go easily to O(n^3). We currently succeed to have results for simulations of 50 or 100 ns, but when we merge it to 850 ns... no results. k-means algorithm is well known and appreciated, it needs a fixed number k of cluster (in fact as the hc when we want to analyze) and then you compute only N distances 20 times (and an average each times). It is so quite fast. Ok, the drawback is that it needs original initialization values for the cluster. In R, at the beginning, it was an issue. It is no more, especially when you have a lot of data. I would be please if it could be use with Python with scipy.
Please share your through.
PS: a small idea. If size of data is an issue ;-), perhaps we can (i) fix a maximum and/or (ii) if it is more than threshold only take one snapshot every x snaps.