pierrepo / PBxplore

A suite of tools to explore protein structures with Protein Blocks :snake:
https://pbxplore.readthedocs.org/en/latest/
MIT License
28 stars 17 forks source link

PBclust is not compatible with most recent versions of R #66

Closed jbarnoud closed 8 years ago

jbarnoud commented 9 years ago

New versions of R need hclust methods to be set to 'ward.D' instead of just 'ward'. We should support these newest versions of R. We should keep supporting older one, though.

alexdb27 commented 9 years ago

New one !!!

alexdb27 commented 9 years ago

By default I prefer we set on complete [not big fan of linkage]

HubLot commented 9 years ago

The issue appears with the version 3.03 (link):

Two different algorithms are found in the literature for Ward clustering. The one used by option "ward.D" (equivalent to the only Ward option "ward" in R versions <= 3.0.3) does not implement Ward's (1963) clustering criterion, whereas option "ward.D2" implements that criterion (Murtagh and Legendre 2014). With the latter, the dissimilarities are squared before cluster updating. Note that agnes(, method="ward") corresponds to hclust(, "ward.D2").

HubLot commented 9 years ago

Regarding the issue #31 and comments of @alexdb27, the ward.D2 method should be use in R with versions > 3.0.3. (also this paper as a reference) For versions <= 3.0.3, the distance matrix has to be squared before calling the function and set the keyword method to ward. This will give the same results: see this example.

The only issue is in the calculation of medoids which will be based on the squared matrix (for versions <= 3.0.3) or the normal matrix (for versions > 3.0.3). Hence, the difference in medoids in the example.

The best solution is to update the python R function hclust in PBlib.py so the input parameter is always the "normal" matrix and a test is made regarding the version of R to determine is the matrix have to be squared. In this way, the code to compute the medoids will not change.

jbarnoud commented 9 years ago

I implemented what @HubLot described in its previous comment. The branch is available here: https://github.com/jbarnoud/PBxplore/tree/compatR.

By doing so I get the expected behavior based on @HubLot notebook. Yet, this means the tests breaks as the result with the squared distance matrix is different from the one with the raw matrix. I need a go from @pierrepo, and perhaps @alexdb27, before I declare the previous behavior wrong and replace the reference file in the regression tests.

pierrepo commented 9 years ago

This is a nice work. However, I am wondering if it would be easier to totally remove R from the clustering process... We could then concentrate our efforts on only one implementation (Python) but with two available methods (HC and k-means) as stated in issue #64 . What do you think?

pierrepo commented 9 years ago

Of course, we can merge this branch. And next implement Python clustering.

jbarnoud commented 9 years ago

I need to change the tests before the branch can be merged. I'll send a pull request when it is done (also, I forgot the case of a user asking for 'ward.D').

About concentrating the efforts on the python implementation, I tackled the ward issue because it causes compatibility issues between versions of R. But the big picture on PB clustering is bothering me a little. We kind of pick clustering methods at random without any proper test of how they perform.

I am writing a benchmark for the different clustering algorithms so we can have an idea of what algo we should push. I think it is a necessary step before spending time on implementing any algo.

pierrepo commented 9 years ago

I agree. Let's do this step by step. Get rid with R issue first. Then, implement clustering with Python. We can do a quick benchmark but mainstream methods such as HC and k-means should be fine. Also one option could be to prepare data for clustering and then let the user implement his/her method of choice. We can of course provide example / demo with few clustering methods.

@alexdb27 and @HubLot we need your opinion on this.

alexdb27 commented 9 years ago

@pierrepo @jbarnoud I think your view is the right one. I think k-means is probably the simplest to use. Hclust will need a large amount of memory for 100s ns traj, won't it? @HubLot

HubLot commented 9 years ago

I agree of doing this step by step and have a benchmark. The distance matrix for Hclust will require a large amount of memory for trajectories. In fact, in the way of the calculation (for the distance matrix) is done now, it will crash. With the k-means algorithm, the issue raised is how to define an "average" sequence which will represent the center of the group? We discussed about it with @alexdb27 few weeks ago, I can create a new issue to develop that point.

So, like @pierrepo said, stick with HC now and trying to implement it with scipy. And then, think about the next/better algorithm.

jbarnoud commented 8 years ago

This is made obsolete by #64. I close the issue.