Failed to write merged molecule PDB

JenkeScheen commented 3 years ago

I'm trying to visualise a merged molecule (i.e. related to https://github.com/michellab/BioSimSpace/issues/209) with RDKit. For this, I need to write a merged molecule to file, then load with RDKit. There is an example available in the playground that shows how to save the lambda 0 state, however this seems to be broken at the moment.

Using the input ligands from https://github.com/michellab/BioSimSpace/issues/209:

m0_a = BSS.Align.rmsdAlign(m0, m1, mapping)

merged = BSS.Align.merge(m0_a, m1, mapping,
                    allow_ring_breaking=True,
                    allow_ring_size_change=True
                    )

BSS.IO.saveMolecules("tmp_images/m0.pdb", merged, "PDB", 
                     { "coordinates" : "coordinates0", "element": "element0" })

triggers an OSError: Failed to save system to format: 'PDB'.

Writing to MOL2 works, however 1) this can't be loaded by RDKit and 2) the written file doesn't seem quite right as atom types are all set to Du:

@<TRIPOS>MOLECULE
Merged_Molecule
   35     0     1     0     0
SMALL
GASTEIGER

@<TRIPOS>ATOM
      1 C            -1.9994   21.5293  -27.3164 Du       1 LIG        0.0000
      2 C            -1.3164   22.3983  -28.1672 Du       1 LIG        0.0000
      3 C            -1.9694   22.9481  -29.2627 Du       1 LIG        0.0000
      4 C            -3.3048   22.6191  -29.5374 Du       1 LIG        0.0000
      5 C            -3.9881   21.7500  -28.6666 Du       1 LIG        0.0000
      6 C            -3.3251   21.2102  -27.5709 Du       1 LIG        0.0000
      7 CL           -5.6530   21.3409  -28.9675 Du       1 LIG        0.0000
      8 C            -4.0265   23.1596  -30.7141 Du       1 LIG        0.0000
      9 O            -4.2001   22.4387  -31.6840 Du       1 LIG        0.0000
     10 N            -4.4988   24.4185  -30.6496 Du       1 LIG        0.0000
     11 C            -5.2646   25.1266  -31.5853 Du       1 LIG        0.0000
     12 C            -5.9161   24.5765  -32.7013 Du       1 LIG        0.0000
     13 C            -6.7243   25.3931  -33.4867 Du       1 LIG        0.0000
     14 N            -6.8703   26.6802  -33.1966 Du       1 LIG        0.0000
     15 C            -6.2676   27.2510  -32.1612 Du       1 LIG        0.0000
     16 C            -5.4905   26.4737  -31.3060 Du       1 LIG        0.0000
     17 N            -6.5137   28.5980  -31.8722 Du       1 LIG        0.0000
     18 C            -5.6953   29.4626  -31.2006 Du       1 LIG        0.0000
     19 O            -4.6118   29.1297  -30.7662 Du       1 LIG        0.0000
     20 C            -6.2201   30.8601  -30.9956 Du       1 LIG        0.0000
     21 CL           -1.1040   24.0589  -30.2782 Du       1 LIG        0.0000
     22 H            -7.2980   30.9502  -31.1421 Du       1 LIG        0.0000
     23 H            -1.4927   21.1017  -26.4622 Du       1 LIG        0.0000
     24 H            -0.2898   22.6477  -27.9682 Du       1 LIG        0.0000
     25 H            -3.8551   20.5333  -26.9159 Du       1 LIG        0.0000
     26 H            -4.3226   24.9809  -29.8312 Du       1 LIG        0.0000
     27 H            -5.7911   23.5289  -32.9394 Du       1 LIG        0.0000
     28 H            -7.2402   24.9806  -34.3391 Du       1 LIG        0.0000
     29 H            -5.0553   26.9157  -30.4146 Du       1 LIG        0.0000
     30 H            -7.3692   29.0300  -32.1819 Du       1 LIG        0.0000
     31 H            -5.6998   31.5372  -31.6708 Du       1 LIG        0.0000
     32 H            -5.9665   31.1908  -29.9981 Du       1 LIG        0.0000
     33 H            -5.4600   32.7700  -30.5400 Du       1 LIG        0.0000
     34 H            -4.6100   31.4100  -29.7700 Du       1 LIG        0.0000
     35 H            -4.3600   31.7500  -31.4900 Du       1 LIG        0.0000
@<TRIPOS>SUBSTRUCTURE
     1 LIG         1  RESIDUE

lohedges commented 3 years ago

Just do this:

BSS.IO.saveMolecules("tmp_images/m0.pdb", merged._toRegularMolecule(), "PDB")

Or if you want the lambda=1 state:

BSS.IO.saveMolecules("tmp_images/m1.pdb", merged._toRegularMolecule(is_lambda1=True), "PDB")

In your example it is likely failing because you would need to provide a map for all properties that are required by the PDB parser, not just coordinates and element. That said, I'm not sure why Mol2 would work. I'll test locally to see if I can see why. Also, if you set

BSS.setVerbose(True)

you'll see the full error trace from Sire, which should tell you more info as to why the parser failed.

(For reference, I didn't write prepareFEP.py, and wouldn't have chosen to save the lambda=0 state like that.)

lohedges commented 3 years ago

It looks like the property map isn't working correctly for PDB for some reason. Not sure why this issue wouldn't have been triggered before. However, using ._toRegularMolecule is the correct way to do things.

lohedges commented 3 years ago

Okay, fixed it. The issue was that I had updated the way we write PDB files within BioSimSpace to inject CONECT records. To do this, I use Sire.Mol.Connectivity. However, I hadn't passed through the property map from saveMolecules, so it couldn't extract the molecular coordinates in order to work out the connectivity matrix. I'll just check that my fix doesn't break anything, then push.

jmichel80 commented 3 years ago

Great ! Definitely happy to improve prepareFEP.py, we have to be careful not to break the script accidentally as it is used in Cresset's pipelines.

JenkeScheen commented 3 years ago

great, thanks for the quick fix!

lohedges commented 3 years ago

Annoyingly this will need to be updated at the Sire level. Despite the Connectivity object taking a PropertyMap as an argument, it doesn't actually use it, so the same error is triggered. Should be a simple fix, but will need a rebuild of Sire too.

lohedges commented 3 years ago

Okay, I've fixed Sire. Bizarrely, despite taking a PropertyMap argument, the BondHunter classes were using another property object to get the names for the coordinates and element properties, which were hard-corded to these values, giving no user flexibility.

Once Sire is built, I'll push the BioSimSpace update. Following that, the existing prepareFEP.py script should work again. (I still note that we should eventually change it to use the ._toRegualrMolecule() functionality instead.)

JenkeScheen commented 3 years ago

just reopening this for now as I'm seeing some odd behaviours when trying to visualise the merged molecule with RDKit. It seems that with the merged.toMolecule() command there are some formatting issues (at least, RDKit doesn't like the formatting..).

With the above setup of merged and BSS.IO.saveMolecules("tmp_images/m0.pdb", merged._toRegularMolecule(), "PDB"), tmp_images/m0.pdb looks like:

MODEL     1
ATOM      1  C   LIG     1      -1.999  21.529 -27.316  1.00  0.00           C
ATOM      2  C   LIG     1      -1.316  22.398 -28.167  1.00  0.00           C
ATOM      3  C   LIG     1      -1.969  22.948 -29.263  1.00  0.00           C
ATOM      4  C   LIG     1      -3.305  22.619 -29.537  1.00  0.00           C
ATOM      5  C   LIG     1      -3.988  21.750 -28.667  1.00  0.00           C
ATOM      6  C   LIG     1      -3.325  21.210 -27.571  1.00  0.00           C
ATOM      7  CL  LIG     1      -5.653  21.341 -28.967  1.00  0.00          CL
ATOM      8  C   LIG     1      -4.026  23.160 -30.714  1.00  0.00           C
ATOM      9  O   LIG     1      -4.200  22.439 -31.684  1.00  0.00           O
ATOM     10  N   LIG     1      -4.499  24.418 -30.650  1.00  0.00           N
ATOM     11  C   LIG     1      -5.265  25.127 -31.585  1.00  0.00           C
ATOM     12  C   LIG     1      -5.916  24.576 -32.701  1.00  0.00           C
ATOM     13  C   LIG     1      -6.724  25.393 -33.487  1.00  0.00           C
ATOM     14  N   LIG     1      -6.870  26.680 -33.197  1.00  0.00           N
ATOM     15  C   LIG     1      -6.268  27.251 -32.161  1.00  0.00           C
ATOM     16  C   LIG     1      -5.490  26.474 -31.306  1.00  0.00           C
ATOM     17  N   LIG     1      -6.514  28.598 -31.872  1.00  0.00           N
ATOM     18  C   LIG     1      -5.695  29.463 -31.201  1.00  0.00           C
ATOM     19  O   LIG     1      -4.612  29.130 -30.766  1.00  0.00           O
ATOM     20  C   LIG     1      -6.220  30.860 -30.996  1.00  0.00           C
ATOM     21  CL  LIG     1      -1.104  24.059 -30.278  1.00  0.00          CL
ATOM     22  H   LIG     1      -7.298  30.950 -31.142  1.00  0.00           H
ATOM     23  H   LIG     1      -1.493  21.102 -26.462  1.00  0.00           H
ATOM     24  H   LIG     1      -0.290  22.648 -27.968  1.00  0.00           H
ATOM     25  H   LIG     1      -3.855  20.533 -26.916  1.00  0.00           H
ATOM     26  H   LIG     1      -4.323  24.981 -29.831  1.00  0.00           H
ATOM     27  H   LIG     1      -5.791  23.529 -32.939  1.00  0.00           H
ATOM     28  H   LIG     1      -7.240  24.981 -34.339  1.00  0.00           H
ATOM     29  H   LIG     1      -5.055  26.916 -30.415  1.00  0.00           H
ATOM     30  H   LIG     1      -7.369  29.030 -32.182  1.00  0.00           H
ATOM     31  H   LIG     1      -5.700  31.537 -31.671  1.00  0.00           H
ATOM     32  H   LIG     1      -5.966  31.191 -29.998  1.00  0.00           H
ATOM     33  H   LIG     1      -5.460  32.770 -30.540  1.00  0.00          XX
ATOM     34  H   LIG     1      -4.610  31.410 -29.770  1.00  0.00          XX
ATOM     35  H   LIG     1      -4.360  31.750 -31.490  1.00  0.00          XX
CONECT    1    2    6   23
CONECT    2    3   24
CONECT    3    4   21
CONECT    4    5    8
CONECT    5    6    7
CONECT    6   25
CONECT    8    9   10
CONECT   10   11   26
CONECT   11   12   16
CONECT   12   13   27
CONECT   13   14   28
CONECT   14   15
CONECT   15   16   17
CONECT   16   29
CONECT   17   18   30
CONECT   18   19   20
CONECT   20   22   31   32
ENDMDL
END

1) The XX notation for dummy atoms breaks the RDKit PDB parser. Here, running both Chem.MolFromPDBFile("tmp_images/m0.pdb", sanitize=True) and Chem.MolFromPDBFile("tmp_images/m0.pdb", sanitize=False) return None. When I substitute XX with At, only sanitize=False returns a molecule: Generating a 2D depiction doesn't work that well though: which brings me to: 2) no CONECT records are being written out for the dummy atoms, which means that in this RDKit is guessing bonds (and clearly not doing a good job). For instance using obabel to convert to SDF: obabel -ipdb tmp_images/m0.pdb -O tmp_images/m0.sdf, we can see the disconnected dummy atoms more clearly:

####################

I think issue 1 should probably be resolved by the user, i.e. substitute XX with At (as XX is intended behaviour, presumably), but writing the CONECTS in issue 2 would help a lot towards depicting a merged molecule.

Okay, fixed it. The issue was that I had updated the way we write PDB files within BioSimSpace to inject CONECT records. To do this, I use Sire.Mol.Connectivity. However, I hadn't passed through the property map from saveMolecules, so it couldn't extract the molecular coordinates in order to work out the connectivity matrix. I'll just check that my fix doesn't break anything, then push.

I've not had the time to check whether this actually fixes issue 2, would you expect it to?

lohedges commented 3 years ago

No, this won't fix issue 2, since I assume that the Sire connectivity code is ignoring dummy atoms. It uses the element information to work out whether atoms are bonded, and it won't know what element to use for a dummy. (It uses a "bond-hunter" rather than the bonding info from the potential.) I might be able to do some hack where I temporarily replace the dummy atoms with their actual elements, i.e. what they'd be in the other state.

For the first issue, I'm actually not sure what the element should be for a dummy in a PDB file? If there is a standard, then let me know and I can update the parser. It's just writing a string representation of the Sire.Element object (with some extra stuff for formal charge, if present) so I can just test for dummy and write something else if needed.

lohedges commented 3 years ago

It looks like RDKit handles * for dummy.

JenkeScheen commented 3 years ago

yes, they do use * for dummies but I'm not sure what happens if you have a *-* bond, because they use dummies for other purposes (in fragmentisation to denote attachment points, mainly)

JenkeScheen commented 3 years ago

FWIW, substituting carbons in a PDB file with * will make the RDkit PDB file reader return None (both with and without sanitization) without jumping through some hoops first.

For an input PDB file with dummies denoted with " *" (note the space):

COMPND    lig_ejm_42 
AUTHOR    GENERATED BY OPEN BABEL 2.3.2
HETATM    1  C   LIG     1      -1.960  21.550 -27.330  1.00  0.00           C  
HETATM    2  C   LIG     1      -1.290  22.410 -28.200  1.00  0.00           C  
HETATM    3  C   LIG     1      -1.960  22.950 -29.290  1.00  0.00           C  
HETATM    4  C   LIG     1      -3.300  22.620 -29.540  1.00  0.00           C  
HETATM    5  C   LIG     1      -3.970  21.760 -28.650  1.00  0.00           C  
HETATM    6  C   LIG     1      -3.290  21.230 -27.560  1.00  0.00           C  
HETATM    7 CL   LIG     1      -5.640  21.350 -28.920  1.00  0.00          Cl  
HETATM    8  C   LIG     1      -4.040  23.150 -30.710  1.00  0.00           C  
HETATM    9  O   LIG     1      -4.230  22.420 -31.670  1.00  0.00           O  
HETATM   10  N   LIG     1      -4.510  24.410 -30.650  1.00  0.00           N  
HETATM   11  C   LIG     1      -5.290  25.110 -31.580  1.00  0.00           C  
HETATM   12  C   LIG     1      -5.960  24.550 -32.680  1.00  0.00           C  
HETATM   13  C   LIG     1      -6.780  25.360 -33.460  1.00  0.00           C  
HETATM   14  N   LIG     1      -6.920  26.650 -33.180  1.00  0.00           N  
HETATM   15  C   LIG     1      -6.300  27.230 -32.160  1.00  0.00           C  
HETATM   16  C   LIG     1      -5.510  26.460 -31.310  1.00  0.00           C  
HETATM   17  N   LIG     1      -6.540  28.580 -31.880  1.00  0.00           N  
HETATM   18  C   LIG     1      -5.710  29.450 -31.230  1.00  0.00           C  
HETATM   19  O   LIG     1      -4.620  29.120 -30.810  1.00  0.00           O  
HETATM   20  C   LIG     1      -6.230  30.850 -31.030  1.00  0.00           C  
HETATM   21 CL   LIG     1      -1.110  24.050 -30.330  1.00  0.00          Cl  
HETATM   22  H   LIG     1      -6.760  31.250 -31.890  1.00  0.00           H  
HETATM   23  H   LIG     1      -1.440  21.130 -26.480  1.00  0.00           H  
HETATM   24  H   LIG     1      -0.260  22.660 -28.020  1.00  0.00           H  
HETATM   25  H   LIG     1      -3.810  20.560 -26.890  1.00  0.00           H  
HETATM   26  H   LIG     1      -4.320  24.980 -29.840  1.00  0.00           H  
HETATM   27  H   LIG     1      -5.840  23.500 -32.910  1.00  0.00           H  
HETATM   28  H   LIG     1      -7.310  24.940 -34.300  1.00  0.00           H  
HETATM   29  H   LIG     1      -5.060  26.910 -30.430  1.00  0.00           H  
HETATM   30  H   LIG     1      -7.400  29.010 -32.180  1.00  0.00           H  
HETATM   31  *   LIG     1      -5.090  31.750 -30.680  1.00  0.00           *  
HETATM   32  H   LIG     1      -6.890  30.870 -30.160  1.00  0.00           H  
HETATM   33  *   LIG     1      -5.460  32.770 -30.540  1.00  0.00           *  
HETATM   34  *   LIG     1      -4.610  31.410 -29.770  1.00  0.00           *  
HETATM   35  *   LIG     1      -4.360  31.750 -31.490  1.00  0.00           *  
CONECT    1    2    6   23                                            
CONECT    2    3   24    1                                            
CONECT    3   21    4    2                                            
CONECT    4    8    3    5                                            
CONECT    5    4    7    6                                            
CONECT    6    5    1   25                                            
CONECT    7    5                                                      
CONECT    8    9   10    4                                            
CONECT    9    8                                                      
CONECT   10   11    8   26                                            
CONECT   11   12   16   10                                            
CONECT   12   13   27   11                                            
CONECT   13   28   14   12                                            
CONECT   14   13   15                                                 
CONECT   15   14   17   16                                            
CONECT   16   15   11   29                                            
CONECT   17   30   15   18                                            
CONECT   18   17   20   19                                            
CONECT   19   18                                                      
CONECT   20   22   18   31   32                                       
CONECT   20                                                           
CONECT   21    3                                                      
CONECT   22   20                                                      
CONECT   23    1                                                      
CONECT   24    2                                                      
CONECT   25    6                                                      
CONECT   26   10                                                      
CONECT   27   12                                                      
CONECT   28   13                                                      
CONECT   29   16                                                      
CONECT   30   17                                                      
CONECT   31   35   20   33   34                                       
CONECT   31                                                           
CONECT   32   20                                                      
CONECT   33   31                                                      
CONECT   34   31                                                      
CONECT   35   31                                                      
MASTER        0    0    0    0    0    0    0    0   35    0   35    0
END

Following paolo's gist we can visualise dummy atoms with the following lines:

So, it looks like this should work; might even be worth implementing in BSS.Align()? We just need to figure out a way to write CONECT records for the dummy atoms.

lohedges commented 3 years ago

Okay, that's great. Writing CONECT records will be easy. As mentioned, I'll just copy the element from the opposite state when computing the connectivity (creating a new molecule object). I'll do this later today if I get a chance. Could we just stick with Xx for dummies and perform a replacement on that instead? That way I won't need to modify the PDB parser.

JenkeScheen commented 3 years ago

yes! Just tested, that works too. Of course the parser has to be changed to:

mol_from_pdb = Chem.MolFromPDBBlock(lig_42_pdb_block.replace("XX", "AT"),
                                    sanitize=False, proximityBonding=False)

If we decide to go for that.

lohedges commented 3 years ago

Annoyingly using the element at the opposite end state doesn't work, since the bond length is non-standard so it doesn't get picked up. I'm going to need to manually create the connect record from the bond terms in the potential instead.

lohedges commented 3 years ago

Okay, that works. Let me know how you get on.

JenkeScheen commented 3 years ago

thanks for the implement - it seems that there might be something off with how the dummies are written, still:

MODEL     1
ATOM      1  C   LIG     1      -1.999  21.529 -27.316  1.00  0.00           C
ATOM      2  C   LIG     1      -1.316  22.398 -28.167  1.00  0.00           C
ATOM      3  C   LIG     1      -1.969  22.948 -29.263  1.00  0.00           C
ATOM      4  C   LIG     1      -3.305  22.619 -29.537  1.00  0.00           C
ATOM      5  C   LIG     1      -3.988  21.750 -28.667  1.00  0.00           C
ATOM      6  C   LIG     1      -3.325  21.210 -27.571  1.00  0.00           C
ATOM      7  CL  LIG     1      -5.653  21.341 -28.967  1.00  0.00          CL
ATOM      8  C   LIG     1      -4.026  23.160 -30.714  1.00  0.00           C
ATOM      9  O   LIG     1      -4.200  22.439 -31.684  1.00  0.00           O
ATOM     10  N   LIG     1      -4.499  24.418 -30.650  1.00  0.00           N
ATOM     11  C   LIG     1      -5.265  25.127 -31.585  1.00  0.00           C
ATOM     12  C   LIG     1      -5.916  24.576 -32.701  1.00  0.00           C
ATOM     13  C   LIG     1      -6.724  25.393 -33.487  1.00  0.00           C
ATOM     14  N   LIG     1      -6.870  26.680 -33.197  1.00  0.00           N
ATOM     15  C   LIG     1      -6.268  27.251 -32.161  1.00  0.00           C
ATOM     16  C   LIG     1      -5.490  26.474 -31.306  1.00  0.00           C
ATOM     17  N   LIG     1      -6.514  28.598 -31.872  1.00  0.00           N
ATOM     18  C   LIG     1      -5.695  29.463 -31.201  1.00  0.00           C
ATOM     19  O   LIG     1      -4.612  29.130 -30.766  1.00  0.00           O
ATOM     20  C   LIG     1      -6.220  30.860 -30.996  1.00  0.00           C
ATOM     21  CL  LIG     1      -1.104  24.059 -30.278  1.00  0.00          CL
ATOM     22  H   LIG     1      -7.298  30.950 -31.142  1.00  0.00           H
ATOM     23  H   LIG     1      -1.493  21.102 -26.462  1.00  0.00           H
ATOM     24  H   LIG     1      -0.290  22.648 -27.968  1.00  0.00           H
ATOM     25  H   LIG     1      -3.855  20.533 -26.916  1.00  0.00           H
ATOM     26  H   LIG     1      -4.323  24.981 -29.831  1.00  0.00           H
ATOM     27  H   LIG     1      -5.791  23.529 -32.939  1.00  0.00           H
ATOM     28  H   LIG     1      -7.240  24.981 -34.339  1.00  0.00           H
ATOM     29  H   LIG     1      -5.055  26.916 -30.415  1.00  0.00           H
ATOM     30  H   LIG     1      -7.369  29.030 -32.182  1.00  0.00           H
ATOM     31  H   LIG     1      -5.700  31.537 -31.671  1.00  0.00           H
ATOM     32  H   LIG     1      -5.966  31.191 -29.998  1.00  0.00           H
ATOM     33  H   LIG     1      -5.460  32.770 -30.540  1.00  0.00          XX
ATOM     34  H   LIG     1      -4.610  31.410 -29.770  1.00  0.00          XX
ATOM     35  H   LIG     1      -4.360  31.750 -31.490  1.00  0.00          XX
CONECT    1    2    6   23
CONECT    2    3   24
CONECT    3    4   21
CONECT    4    5    8
CONECT    5    6    7
CONECT    6   25
CONECT    8    9   10
CONECT   10   11   26
CONECT   11   12   16
CONECT   12   13   27
CONECT   13   14   28
CONECT   14   15
CONECT   15   16   17
CONECT   16   29
CONECT   17   18   30
CONECT   18   19   20
CONECT   20   22   31   32
CONECT   31   33   34   35
ENDMDL
END

or perhaps this is one of the cases where there is an indexing mismatch between RDKit and BSS?

lohedges commented 3 years ago

Sorry, what's the issue here, just the way that RDKit is placing the dummies in the visualisation? The CONECT looks okay to me, in at least everything is in ascending numerical order with no duplicates (which few programs actually do). Is RDKit just ignoring the CONECT for the AT atoms? What happens if you turn proximityBonding on?

JenkeScheen commented 3 years ago

My bad, I'd gotten confused into thinking the dummy atoms should be attached to C:19, not C:30. So yes, this works fine although RDKit will of course complain about the hydrogen valencies. Behaviour is as expected:

To answer your question, setting proximityBonding to True lets RDKit come up with loads of bonds (AT is rather large):

Closing this issue for now. I'll reopen it if I run into a similar use-case again, then I might PR a viz implementation.

lohedges commented 3 years ago

Phew, glad it's working. For the life of me I couldn't work out what was wrong and was worried that I'd completely misinterpreted your problem.

Yes, a PR would be much welcome. Something along the lines of drawMapping would be good. The function could take a merged molecule (checking that it is merged with .isMerged()). The user could also pass is_lambda1=True to visualise the lambda=1 state instead. (This approach is used elsewhere in the code.)

michellab / BioSimSpace

Failed to write merged molecule PDB #223