Closed j-wags closed 3 years ago
I just tested the new threshold. For my dataset comprising 259 molecules and 1053 conformers, validate
reduces the set to 267 molecules and 349 conformers. Of the 704 conformers, which were removed to error_mols, 10 failed (unspecified stereochemistry) and the other 694 were removed because they are duplicates. That's a big change to before where all of the latter conformers were included.
generate-conformers
reduces the number of molecules to 265 (errors not specified) and increases the conformers to 1065 conformers. 79 molecules have only one conformer, the other 186 molecules have at least two conformers.
After looking at couple of molecules with only 1 conformer (rigid aromatic molecules), I agree to the new thresholds.
For my dataset comprising 259 molecules and 1053 conformers, validate reduces the set to 267 molecules and 349 conformers
Hm, quick sanity check. Are the numbers 259 and 267 switched here?
That's a dramatic reduction in number of conformers, though I could imagine that PDB ligands would be biased toward the same conformers. Again as a sanity check, do the "deduplicated" conformers seem reasonable upon visual inspection?
The error_mols/error_mol_X.txt
should contain the reasoning behind considering a molecule to be an error, and reference which molecule it's a duplicate of (to help with visual inspection of "redundant" conformers)
Hm, quick sanity check. Are the numbers 259 and 267 switched here?
No, this is correct. Some of the 259 original molecules were split up into different molecular entities. I guess due to different enantiomers or bond orders.
That's a dramatic reduction in number of conformers, though I could imagine that PDB ligands would be biased toward the same conformers. Again as a sanity check, do the "deduplicated" conformers seem reasonable upon visual inspection?
It's reasonable, but not really obvious. Attached is one example of a set of conformers. The orange and the silver conformer are part of the set, the other four are error mols.
Here are the SDFs JAN-00000-00.zip
Some of the 259 original molecules were split up into different molecular entities. I guess due to different enantiomers or bond orders.
That makes sense. Probably stereoisomers of pyrimidal nitrogens if I had to guess. It'll be a nice day when we can remove this behavior from the toolkit.
Thanks for trying this out, @dfhahn. Merging!