Closed mgiulini closed 2 years ago
Yes, two modules are better than one. We can keep the RMSD matrix and use in different modules as input.
What I suggest is for you to prepare a python script that does everything you need to calculate the RMSD matrix. That python script should have a function that receives a list of paths to PDB files. This function will be the entry point for haddock3.
I suggest doing the implementation in pure python (numpy, scipy, etc). Don't worry at first with the performance. Once this is working, we will work with the performance using Numba.
Finally, we will organize your code into the haddock3 code, but most likely it will all remain inside the RMSD module.
Have a look around if you need any functions from HADDOCK3 already.
Once everything is working in your computer, we will work to add it as a module :wink: Welcome!
Possibly with different options on how to perform the fitting and RMSD calculations (i.e. molecule/residue selection)
What I suggest is for you to prepare a python script that does everything you need to calculate the RMSD matrix. That python script should have a function that receives a list of paths to PDB files. This function will be the entry point for haddock3
The rmsd calculation is implemented in caprieval
with kabsch, you should then pass all the atoms instead of the only the interface - it handles different numbering and all that and small molecules
This is the one Joao suggested me to make it from scratch. Unless he also suggested you to re-do it from scratch because of the project guidelines, I'd suggest you try to re-use the functions the best as possible!
Possibly with different options on how to perform the fitting and RMSD calculations (i.e. molecule/residue selection)
This is how the capri class handles it, the fitting is based on either the interface, ligand, etc - could also change it to pass a numbering selection
Possibly with different options on how to perform the fitting and RMSD calculations (i.e. molecule/residue selection)
This is how the capri class handles it, the fitting is based on either the interface, ligand, etc - could also change it to pass a numbering selection
Will be needed to ensure the FIT / RMSD are calculated on the same set of residues since we need to compare all models vs all models.
yep, the way the rmsd is calculated in the capri module is by using the interface/ligand/interface-ligand atoms from the receptor/lowest scoring structure, would just be a matter of passing a list of numbers instead.
would be great actually to see how it would work with https://github.com/biopython/biopython/pull/3849 - since biopython is one of the allowed dependencies
Yes, that seems interesting, especially in terms of performance with respect to the fully pythonic code. If you agree, I will try to implement the clustering procedure using these biopython routines to calculate the pairwise rmsd matrix.
I think its a great idea, if it useful just duplicate part of the capri.py for now, if you want to go the extra mile would be nice to have a different superimposition algorithm other than kabsch 🤓
I saw the link on the biopython PR, thanks for the mention @rvhonorato. I think there is a slight confusion regarding superposition algorithms though.
CEAlign does not replace Kabsch, in fact it uses Kabsch to do the actual structure alignment (by default). The advantage of CEAlign (or any structural alignment code) is that it derives a mapping of which residues correpond to which between two structures, when sequence identity is low. For clustering purposes, you usually deal with multiple conformations of the same molecule, so, sequence identity is very high. Structural alignment codes are not suitable in these situations because they can introduce slight register shifts.
If you want a fast(er) superposition algorithm, have a look at this https://github.com/biopython/biopython/pull/3876. QCP is roughly 30% faster than Kabsch. You'd want to derive a mapping of residues a priori by simple pairwise matching, since the sequences are all the same, or, in case of deletions, insertions, use sequence alignment. Then, you'd extract the coordinates for the relevant atoms and use your favorite Kabsch or QCP. See this for an example: https://gist.github.com/JoaoRodrigues/e3a4f2139d10888c679eb1657a4d7080
TL;DR. Don't use CEAlign for clustering conformations of the same molecule. In case of deletions, use sequence alignment. If performance matters, you can do a preprocessing step where you extract all sequences from all the molecules and do a single multiple sequence alignment to derive mappings in one go (saves you the N pairwise alignments).
Thanks for the clarification @JoaoRodrigues !
@amjjbonvin I found the cluster_struct .cpp
(https://github.com/haddocking/haddock/blob/b311dd900a5c62d66b727be196a40e9b6fa7d994/tools/cluster_struc.cpp) script used in haddock2.4. Do we want to adapt it in order to add the new desired functionalities? Or it's better to rebuild it from scratch in python? At a first glance I would go for the first option
We can integrate it in haddock3 - it is in principle our own code
If you make it python with numba I will :heart: you. But if that's too much work, we can go with the cpp
. I can help you on anything you need.
I need to think a bit about it..the optimal implementation from scratch of the various methods is not difficult, except for the labeling routines, which are tricky.
Note that cluster_struc only does the clustering and not the calculation of the RMSD matrix.
Yes indeed!
Clustering structures with the RMSD matrix can be beneficial in certain situations, for example when a small molecule is considered as a ligand.
There could be two modules for it:
Discussed with @joaomcteixeira and @amjjbonvin