RMSD clustering implementation

mgiulini commented 2 years ago

Clustering structures with the RMSD matrix can be beneficial in certain situations, for example when a small molecule is considered as a ligand.

There could be two modules for it:

a RMSD matrix calculation module
a clustering module, with some flexibility on the clustering algorithm and on the thresholds

Discussed with @joaomcteixeira and @amjjbonvin

joaomcteixeira commented 2 years ago

Yes, two modules are better than one. We can keep the RMSD matrix and use in different modules as input.

What I suggest is for you to prepare a python script that does everything you need to calculate the RMSD matrix. That python script should have a function that receives a list of paths to PDB files. This function will be the entry point for haddock3.

I suggest doing the implementation in pure python (numpy, scipy, etc). Don't worry at first with the performance. Once this is working, we will work with the performance using Numba.

Finally, we will organize your code into the haddock3 code, but most likely it will all remain inside the RMSD module.

Have a look around if you need any functions from HADDOCK3 already.

Once everything is working in your computer, we will work to add it as a module :wink: Welcome!

amjjbonvin commented 2 years ago

Possibly with different options on how to perform the fitting and RMSD calculations (i.e. molecule/residue selection)

What I suggest is for you to prepare a python script that does everything you need to calculate the RMSD matrix. That python script should have a function that receives a list of paths to PDB files. This function will be the entry point for haddock3

rvhonorato commented 2 years ago

The rmsd calculation is implemented in caprieval with kabsch, you should then pass all the atoms instead of the only the interface - it handles different numbering and all that and small molecules

This is the one Joao suggested me to make it from scratch. Unless he also suggested you to re-do it from scratch because of the project guidelines, I'd suggest you try to re-use the functions the best as possible!

rvhonorato commented 2 years ago

Possibly with different options on how to perform the fitting and RMSD calculations (i.e. molecule/residue selection)

This is how the capri class handles it, the fitting is based on either the interface, ligand, etc - could also change it to pass a numbering selection

amjjbonvin commented 2 years ago

Possibly with different options on how to perform the fitting and RMSD calculations (i.e. molecule/residue selection)

This is how the capri class handles it, the fitting is based on either the interface, ligand, etc - could also change it to pass a numbering selection

Will be needed to ensure the FIT / RMSD are calculated on the same set of residues since we need to compare all models vs all models.

rvhonorato commented 2 years ago

yep, the way the rmsd is calculated in the capri module is by using the interface/ligand/interface-ligand atoms from the receptor/lowest scoring structure, would just be a matter of passing a list of numbers instead.

rvhonorato commented 2 years ago

would be great actually to see how it would work with https://github.com/biopython/biopython/pull/3849 - since biopython is one of the allowed dependencies

mgiulini commented 2 years ago

Yes, that seems interesting, especially in terms of performance with respect to the fully pythonic code. If you agree, I will try to implement the clustering procedure using these biopython routines to calculate the pairwise rmsd matrix.

rvhonorato commented 2 years ago

I think its a great idea, if it useful just duplicate part of the capri.py for now, if you want to go the extra mile would be nice to have a different superimposition algorithm other than kabsch 🤓

JoaoRodrigues commented 2 years ago

I saw the link on the biopython PR, thanks for the mention @rvhonorato. I think there is a slight confusion regarding superposition algorithms though.

CEAlign does not replace Kabsch, in fact it uses Kabsch to do the actual structure alignment (by default). The advantage of CEAlign (or any structural alignment code) is that it derives a mapping of which residues correpond to which between two structures, when sequence identity is low. For clustering purposes, you usually deal with multiple conformations of the same molecule, so, sequence identity is very high. Structural alignment codes are not suitable in these situations because they can introduce slight register shifts.

If you want a fast(er) superposition algorithm, have a look at this https://github.com/biopython/biopython/pull/3876. QCP is roughly 30% faster than Kabsch. You'd want to derive a mapping of residues a priori by simple pairwise matching, since the sequences are all the same, or, in case of deletions, insertions, use sequence alignment. Then, you'd extract the coordinates for the relevant atoms and use your favorite Kabsch or QCP. See this for an example: https://gist.github.com/JoaoRodrigues/e3a4f2139d10888c679eb1657a4d7080

TL;DR. Don't use CEAlign for clustering conformations of the same molecule. In case of deletions, use sequence alignment. If performance matters, you can do a preprocessing step where you extract all sequences from all the molecules and do a single multiple sequence alignment to derive mappings in one go (saves you the N pairwise alignments).

rvhonorato commented 2 years ago

Thanks for the clarification @JoaoRodrigues !

mgiulini commented 2 years ago

@amjjbonvin I found the cluster_struct .cpp (https://github.com/haddocking/haddock/blob/b311dd900a5c62d66b727be196a40e9b6fa7d994/tools/cluster_struc.cpp) script used in haddock2.4. Do we want to adapt it in order to add the new desired functionalities? Or it's better to rebuild it from scratch in python? At a first glance I would go for the first option

amjjbonvin commented 2 years ago

We can integrate it in haddock3 - it is in principle our own code

joaomcteixeira commented 2 years ago

If you make it python with numba I will :heart: you. But if that's too much work, we can go with the cpp. I can help you on anything you need.

mgiulini commented 2 years ago

I need to think a bit about it..the optimal implementation from scratch of the various methods is not difficult, except for the labeling routines, which are tricky.

amjjbonvin commented 2 years ago

Note that cluster_struc only does the clustering and not the calculation of the RMSD matrix.

mgiulini commented 2 years ago

Yes indeed!

haddocking / haddock3

RMSD clustering implementation #380