project-gemmi / gemmi

macromolecular crystallography library and utilities
https://project-gemmi.github.io/
Mozilla Public License 2.0
206 stars 42 forks source link

[Feature Request] RMSD for comparing small molecules #43

Open ConorFWild opened 4 years ago

ConorFWild commented 4 years ago

Having a function to calculate the RMSD between small molecules whose atoms are not necessarily listed in the same order seems like it would be potentially very useful, although it would need some graph matching and symmetry shenanigans!

wojdyr commented 3 years ago

I added calculation of RMSD and the superposition for residue spans: https://gemmi.readthedocs.io/en/latest/analysis.html#superposition It does the sequence alignment to find matching residues, but then it is assumed that atoms with the same names are corresponding to each other (alternatively, only Cα atoms are used for the superposition).

When you wrote about small molecules - did you mean small molecules represented by SmallStructure or Residue?

ConorFWild commented 3 years ago

I was thinking SmallStructure ^^

wojdyr commented 3 years ago

What pair of molecules could be used as an example?

wojdyr commented 3 years ago

I was about to add a function that superposes two Residues, but after thinking about – indeed, it'd be good to account for swapping of equivalent atoms, such as CG1 and CG2 in valine. Which would require iterating over graph automorphisms. So I started searching for a C++ library that could do it and that could be integrated with gemmi. Apparently, there are two separate classes of algorithms we could use:

I'm writing it all down to not forget it. Perhaps I'll get back to it at some point. One question is how efficient a subgraph isomorphism algorithm is for finding automorphisms. Another – how hard it'd be to integrate it with gemmi and if it's worth the effort.

CV-GPhL commented 3 years ago

Hi Marcin,

On Sat, Jan 23, 2021 at 12:15:17PM -0800, Marcin Wojdyr wrote:

I was about to add a function that superposes two Residues,

I often use two different types of comparisons: one with superposition and one without. The first one is important if you want to compare e.g. two residues in context (within a given unit cell or a given binding site) while the latter will compare relative differences (after computing RT operators).

but after thinking about – indeed, it'd be good to account for swapping of equivalent atoms, such as CG1 and CG2 in valine.

I would highly recommend not to allow swapping of these atoms - at least not by default: the resulting structure is different, i.e. the atom-name based conformational restraints will get broken.

What you can do is allow for a 180-degree rotation for (a) the truly symmetrical side-chains (PHE, TYR), (b) the symmetrical side-chains if explicitly placed hydrogens are being ignored (ASP, GLU) and (c) side-chains that are symmetrical in terms of atom location (ASN, GNL, HIS). As an extension you could also implement 180-degree rotations for VAL, LEU etc.

I would definitely go for a rotation and not a swaping: we had the latter initially in BUSTER with sometimes odd side-effects. Defining a rotation is much more reliable - see

grep SWAP $BDG_home/tnt/data/protgeo_eh99.dat

Which would require iterating over graph automorphisms.

Indeed: if the two copies come e.g. from differnet SMILES strings you can't rely on atom naming at all.

So I started searching for a C++ library that could do it and that could be integrated with gemmi.

Have you had a look at BALL yet? Maybe it doesn't do exactly what you are looking for, but I really like it and have used it in multiple projects so far. Yes, documentation is a bit scattered, but it has some examples e.g. here:

https://github.com/BALL-Project/ball/wiki/RMSD

I'd love if that could be connected to Gemmi :-)

Cheers

Clemens

Apparently, there are two separate classes of algorithms we could use:

  • for graph isomorphism: codes such as nauty and traces, saucy, bliss, conauto. The first two are available (together) under the Apache license and these are the only ones in the list that would be license-compatible with gemmi. But it'd be rather a significant effort to integrate them with gemmi. Perhaps we could work with pynauty? I haven't investigated it. I suppose there are also codes I haven't come across. I see GraphSymmetryFinder class in a bigger package, but I can't even tell what method it is using.

  • for subgraph isomorphism: codes such as vf2/vf3 and RI. The latter is under the MIT license and it's a small, header-only C++ library. It could be a good fit. But using subgraph isomorphism for automorphism is probably far from optimal. I haven't found any benchmarks, though. The RI paper contains benchmarks of small molecule matching, but for subgraph matching.

I'm writing it all down to not forget it. Perhaps I'll get back to it at some point. One question is how efficient a subgraph isomorphism algorithm is for finding automorphisms. Another – how hard it'd be to integrate it with gemmi and if it's worth the effort.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/project-gemmi/gemmi/issues/43#issuecomment-766172814

--

*--------------------------------------------------------------

wojdyr commented 3 years ago

but after thinking about – indeed, it'd be good to account for swapping of equivalent atoms, such as CG1 and CG2 in valine.

I would highly recommend not to allow swapping of these atoms - at least not by default: the resulting structure is different, i.e. the atom-name based conformational restraints will get broken. What you can do is allow for a 180-degree rotation for (a) the truly symmetrical side-chains (PHE, TYR),

I don't get it. How is the rotation different than swapping atoms in such case? Are there any restraints that differentiate between CG1 and CG2 in valine?

Have you had a look at BALL yet? Maybe it doesn't do exactly what you are looking for, but I really like it and have used it in multiple projects so far. Yes, documentation is a bit scattered, but it has some examples e.g. here: https://github.com/BALL-Project/ball/wiki/RMSD

I don't think I can connect the two libraries, but I could borrow some ideas, such as having a class similar to AtomBijection in BALL.

CV-GPhL commented 3 years ago

Hi Marcin,

On Mon, Jan 25, 2021 at 03:31:46AM -0800, Marcin Wojdyr wrote:

I would highly recommend not to allow swapping of these atoms - at least not by default: the resulting structure is different, i.e. the atom-name based conformational restraints will get broken. What you can do is allow for a 180-degree rotation for (a) the truly symmetrical side-chains (PHE, TYR),

I don't get it. How is the rotation different than swapping atoms in such case?

A rotation will preserve the atom-name specific environment, while a swap will not. Remember that at the point where you want to compare two occurences of a compound, you don't know if a change was due to a (very unlikely) atom-name mixup or because a program imposed a chi-angle convention (e.g. Coot when it wants to rotate Phe/Tyr residues) and therefore rotation.

I find a rotation more logical ... it would do what I as a user would do manually in a display program without any messing up.

Are there any restraints that differentiate between CG1 and CG2 in valine?

True ... because they are chemically equivalent and the VAL restraint dictionaries (being standard amino acids) have been "normalised". So we have e.g. for Engh&Huber:

GEOMETRY VAL ANGLE 109.0 3.0 CB CG1 HG11
GEOMETRY VAL ANGLE 109.0 3.0 CB CG1 HG12
GEOMETRY VAL ANGLE 109.0 3.0 CB CG1 HG13
GEOMETRY VAL ANGLE 109.0 3.0 CB CG2 HG21
GEOMETRY VAL ANGLE 109.0 3.0 CB CG2 HG22
GEOMETRY VAL ANGLE 109.0 3.0 CB CG2 HG23

But if you would generate the restraints afresh (e.g. with Grade) you get

VAL CB CG1 HG11 111.2 3.0 VAL CB CG1 HG12 112.5 3.0 VAL CB CG1 HG13 113.3 3.0

VAL CB CG2 HG21 112.9 3.0 VAL CB CG2 HG22 112.5 3.0 VAL CB CG2 HG23 111.5 3.0

I.e. very slightly different angles for the different hydrogens - and now they CG1/CG2 are not equivalent any more. This is what comes back from the CSD and it might not be very sensible (definitely not in this case - which is why we have curated Engh&Huber dictionaries) ... but for generic compounds that manual inspection or automatic curation might not happen at all.

I was also thinking more of rotamer libraries (as well as NCS-relations): those will use atom naming and a swap that is not a rotation could confuse them. In general I would always expect a VAL to look like this:

https://www.rcsb.org/ligand/VAL

and not having CG1/CG2 (plus hydrogens) swapped.

Have you had a look at BALL yet? Maybe it doesn't do exactly what you are looking for, but I really like it and have used it in multiple projects so far. Yes, documentation is a bit scattered, but it has some examples e.g. here: https://github.com/BALL-Project/ball/wiki/RMSD

I don't think I can connect the two libraries,

Probably not ...

but I could borrow some ideas, such as having a class similar to AtomBijection in BALL.

... sounds interesting, yes.

Cheers

Clemens

--

*--------------------------------------------------------------

wojdyr commented 3 years ago

Do you know an algorithm that can find all rotations for any small molecule?

CV-GPhL commented 3 years ago

Hi Marcin,

do you mean all internal (torsion) rotations? Wouldn't you need a dictionary for that?

Cheers

Clemens

On Mon, Jan 25, 2021 at 07:36:05AM -0800, Marcin Wojdyr wrote:

Do you know an algorithm that can find all rotations for any small molecule?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/project-gemmi/gemmi/issues/43#issuecomment-766901353

--

*--------------------------------------------------------------

wojdyr commented 3 years ago

do you mean all internal (torsion) rotations? Wouldn't you need a dictionary for that?

I meant how to use rotations instead of atom swapping in general case?

chmnk commented 2 years ago
* for **subgraph isomorphism**: codes such as vf2/[vf3](https://github.com/MiviaLab/vf3lib) and [RI](https://github.com/InfOmics/RI)

Hi Conor and Marcin, since the issue is still open, would like to add that symmetry-adapted RMSD computation of small molecules is nicely done in RDKit (CalcRMS and GetBestRMS with alignment) based on the VF2 algorithm. Some benchmarking was also done a while ago (Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2) (maybe there are more recent papers on it).