molinfo-vienna / CDPKit

The Chemical Data Processing Toolkit
https://cdpkit.org
GNU Lesser General Public License v2.1
65 stars 9 forks source link

Bad conformers generated for several cases in Astex and Posebuster virtual screening datasets #13

Closed Hong-Rui closed 9 months ago

Hong-Rui commented 9 months ago

Hi Thomas, Recently we have tested the CDPKit confgen for the astex and posebuster datasets. We curated the crystal structures for each of the ligands, apply some Glide-style fragmentation for the ligands, and then use confgen to regenerate conformations for these fragments.

confgen_problems.zip

Here I attached several cases where we find that confgen failed to recover the crystal conformer of ligands. We categorize these cases into 4 classes:

  1. The confgen will generate some cis form amide conformer, which we think should not happen and not found in the crystal structures.

  2. We found some cases with double bond stereo problems as we mentioned in the last issue. I input the confgen with a SMILES converted from the crystal ligand and get wrong double bond stereos as returned. In the attached files we give two of this kind of cvases.

  3. For some cases there are saturated hetero rings in ligands. We give 2 cases where none of the generated conformations recover the correct ring parkering state with crystal conformer.

  4. And the most serious issue now. For lots of cases there are still lacks of torsion rules to recover crystal conformers for some very specific torsion environments. None of the generated conformations matched the correct dihedral angle values for that specific torsions, and we cannot reproduce similar 3D orientation after regenerating conformations. For this part we give 12 cases. And some of them are easy, while several of them are really much more difficult since they are just huge and with > 10 torsions. For these huge cases, maybe it's better to try to sample all the torsion angles efficiently rather than adding torsion ranges in the library one by one.....

Similar to last time, for all the cases, the PDB_ID_ligand.sdf represent the crystal structure of the ligands, and the aligned_root_conf.sdf represents conformers re-generated from the confgen.

Thanks for the help, and hopes these can be fixed soon!

seidelt commented 9 months ago

Hi Hong-Rui.

thank you for testing and reporting problematic cases!

  1. The cis form is perfectly valid (obviously it is fine from an energetic point of view, otherwise it wouldn't have ended up in the output ensemble)! Of course most of the time you will see trans but not trying also the cis form might be bad for some molecules. I have seen the craziest amid bond conformations in crystal structures!

  2. Plases send me the SMILES string you were using as input. I used the PDB ligand structures as input and the generated double bond stereochemistry was correct for both molecules!

3 and 4. First some general notes: The output ensemble you get is usually (especially with very flexible molecules) only a subset of a much larger set of generated conformers. Furthermore, during the RMSD-based conformer picking process you loose many low-energy conformers that might perfectly fit the bioactive conformation. It is to some extent a matter of luck that the best matching conformer ends up in the final output ensemble. If you do not want to rely on luck you need to store all conformers conforge generates. This can be done by providing 0 as argument the options -n and -r (you can also enlarge the energy window a bit, e.g. -e 20).

I have attached two *.csv files with the calculated best RMSDs for the ligands that you have provided. One file shows the RMSDs obtained for the conformer ensembles generated with default setting and the second file shows the results obtained for the max. large ensembles (-n 0 -r 0 -e 20). As expected, the unconstrained ensembles for most ligands lead to lower RMSDs. However, they are not that much better given the fact that these ensembles comprise thousands of conformers.

Regarding point 3: The best matching conformer (generated with def. setting) for 1SJ0 indeed does not have a chair conformation. However, the RMSD is 0.67 which is not bad. The best matching conformer from full ensemble features the desired chair conformation but the RMSD only slightly improves to 0.66. 6YJA contains a small macrocycle and the best RMSD with default settings (100 confs.) is 0.38. The RMSD does not improve with more conformers (full ensemble 690 confs.). IMHO one cannot complain about this quite good results!

Regarding point 4: One cannot expect that for all molecules out there a conformer coming perfectly close to the bioactive conformation will be generated! No conformer generator exists that can do that in a reasonable amount of time without knowledge of the binding site structure. Given that your molecules are terribly flexible (> 15 rot bonds) the obtained best RMSDs are still quite good (max. 3.92 using default settings and max. 2.85 with full ensembles)! If you want better results for such structures you have to switch to more time consuming conformer sampling methods. Here I can't do anything for you!

rmsds_def.csv rmsds_n0_r0_e20.csv

Hong-Rui commented 9 months ago

Hi Thomas,

For class 1, in our opinion, the conformer generator should always give only trans amide. We've also seen some crazy crystal structures with cis form amide, and we don't believe that as a correct crystal structure. In nearly all drug discovery scenarios, we won't use a cis form amide as a valid molecule during all virtual screening process. If it is possible, we want a feature that we can have a option to only output conformer with trans amide, which is also a feature of RDKIt EDKTG conf generator.
embedding_parameters.forceTransAmides = True

  1. For 1Q41, the SMILES from crystal ligand is [H]O/N=C1C(=C2/C(=O)N([H])c3c([H])c([H])c([H])c([H])c32)/N([H])c2c([H])c([H])c([H])c([H])c2/1 and after running confgen with confgen -i test.smi -o test.sdf -t 32 -n 1000 -r 0.1 -A -C LARGE_SET_DIVERSE -v DEBUG I get back the SMILES, converted from the conformer, [H]O/N=C1C(=C2/C(=O)N([H])c3c([H])c([H])c([H])c([H])c32)\N([H])c2c([H])c([H])c([H])c([H])c2\1

The crystal structure:

1702470370022

generated conformer:

1702470432818

For 7ZXV, the SMILES from crystal structure is [H]/C(C1=C(C([H])([H])[H])C(=O)C([H])([H])C([H])([H])C1(C([H])([H])[H])C([H])([H])[H])=C([H])\C(=C([H])\C([H])=C([H])\C(=C([H])\C([H])=C([H])\C([H])=C(C(\[H])=C([H])\C([H])=C(C(\[H])=C(/[H])C1=C(C([H])([H])[H])C(=O)C([H])([H])C([H])([H])C1(C([H])([H])[H])C([H])([H])[H])/C([H])([H])[H])/C([H])([H])[H])C([H])([H])[H])C([H])([H])[H]

and after confgen, the SMILES for generated conformer is [H]/C(C1=C(C([H])([H])[H])C(=O)C([H])([H])C([H])([H])C1(C([H])([H])[H])C([H])([H])[H])=C([H])/C(=C([H])/C([H])=C([H])\C(=C([H])\C([H])=C([H])\C([H])=C(C(\[H])=C([H])\C([H])=C(C(\[H])=C(/[H])C1=C(C([H])([H])[H])C(=O)C([H])([H])C([H])([H])C1(C([H])([H])[H])C([H])([H])[H])/C([H])([H])[H])/C([H])([H])[H])C([H])([H])[H])C([H])([H])[H]

The crystal structure:

1702470921193

The generated conformer:

1702470989059

The confgen generated 14 conformer, and none of them match the correct trans form.

  1. For 1SJ0, I think the main issue lies more in the torsion rules. However, we do want a balanced distribution of the conforemrs generated to have equal number of chairs and boat conformations for saturated heteroatom rings, so that we can guarantee the conformer diversity. (except for some imposible cases)

For 6YJA, the crystal structure is

1702471536872

and none of our generated fragment conformer matched this macrocycle conformation: some example: (and as you can see from my previous attached folder)

1702471645986 1702471673664

For some algorithmic reason, I usually need to pass a SMILES to confgen and let it to embed the conformation.

  1. I think for molecules with > 15 rotatable bonds, it better that we left this kind of torsion sampling to the docking itself. However, for some smaller molecules, such as 1G9V, 1SJ0, 1YVF, 1Z95 and 2BSM, I would expect that confgen would at least output a few conformations that match the torsion state that the crystal structure presents. In my calculation, I use SMILES as the input to confgen, and for these cases I sent you, none of them have best RMSD smaller than 1.0.

I think you can just get the SMILES converted from the 3D SDF, in RDKit it's just like:

1702472151284

As an example for bad torsions: this is the crystal structure for 1SJ0:

1702472223227

and all of the generated conformers are like:

1702472284583

The bond connected between the fused ring and the aromatic benzene should be rotated so that it can match the crystal conf, and none of them get matched. I think just check this torsion rule in the torsion library, and maybe adding one vaue ranges could solve this kind of issues.

And that's why I give you so many cases, by adding this rules one by one, the torsion library should reproduce most of the crystal structures, at least for cases with # of rot bonds < 10.

I'm developing a docking protocol that involves fragment rigid docking, and so I need to rely on some powerful conformation generator that can reproduce crystal structure of fragments to some degree. And I think this is doable, as this is what torsion library, collected from crystal ligand distributions, is trying to do exactly.

seidelt commented 9 months ago

Hi Hong-Rui,

Conforge now generates the correct double bond geometries for the problematic SMILES strings. The observed problems originated from the SMILES parser which has been fixed now (master branch). Thanks for pointing me to this problem, it was there for more than 15 years and has, up to now, not been recognized!

Regarding trans-form only amide rotamers: The cis form is not wrong, I disagree here. I have seen all sorts of amide bond torsions also in high quality X-ray structure datasets.

For people being unhappy with the default torsion angles Conforge provides the option -k which allows to specify a custom torsion library whose entries override any matching rules in the default library.

Just save the following xml-snippet as a file and provide the path to this file as argument to the -k option:

<library name="CustomTorsionLibrary">
  <category name="GG" atomType1="*" atomType2="*">
    <rule pattern="[O:1]~[CX3:2]-[NX3H1:3]~[*:4]">
      <torsions>
        <angle value="0.0" tolerance1="0.0" tolerance2="0.0" score="0.0"/>
      </torsions>
    </rule>
  </category>
</library>

Regarding point 3: In order to not lose any generated low-energy conformers I recommend to turn off the diversity picking procedure (-r 0.0, see my last posting) and set the max. output ensemble size to 250 conformers, at least!

I have added some missing torsion angles (for 1SJ0) and I don't know exactly if this also affected the results obtained for 6YJA (but I don't think so). The output ensemble for 6YJA (generated with all default settings!) does contain conformers that show the desired ring conformation:

6YJA_confs

However, what you are trying to do (if I understood it correctly) is highly error-prone because it can't be expected that ring systems will always prefer the same conformation for different types of substituents. Differences in steric demand and electronic properties likely result in changes regarding the energetic ranking of the possible ring conformers. Not encountering your favorite ring conformer does not mean Conforge generates bad conformers. Exchanging substituents results in different molecules with distinct conformational spaces. After all, the goal of Conforge is to generate diverse low-energy conformer ensembles of WHOLE molecules and not to generate ensembles that feature all sorts of theoretically possible ring conformers.

Hong-Rui commented 9 months ago

Hi Thomas, Thanks for your fixes on these issues!

I've tested the new code and now the bond stereo works fine. The addition of torsion rules now also fixed several problematic molecules. Now we've implemented our own torsion drives by using the new torsion library 3.0 from the literature. https://pubs.acs.org/doi/10.1021/acs.jcim.2c00043

We implemented our own deep first search hierarchy torsion matching algorithm and rotatable bond definitions. By adding torsion rules one by one, lots of molecules reproduce crystal conformations.

Maybe some day the CDPKit should also update the torsion library to the latest version?

Anyway, thanks for our help!

seidelt commented 9 months ago

Hi Hong-Rui,

great, glad to hear that the bond stereo issue is gone now. Again, many thanks for identifying and reporting bugs!

I will have a look at the new torsion library release and see how it performs. Unfortunately, the V2.0 library in many cases did not provide all preferred angles for torsion fragments with topological symmetry (where multiple mappings of the SMARTS-pattern are possible). Eventually, I had to do my own torsion angle distribution analysis and added missing angles manually which was a lot of work in the end. Hope they fixed these issues in the V3.0 version.

I will then close this issue.... Just create a new one if you encounter any additional bugs or problems!