isayev / ANI1_dataset

A data set of 20 million calculated off-equilibrium conformations for organic molecules
MIT License
96 stars 18 forks source link

mol2 files describing molecular topology? #4

Open jchodera opened 6 years ago

jchodera commented 6 years ago

Thanks for making this fantastic resource available!

Is there a way you could make some description of the molecular topology (e.g. mol2 files) available? While the QM energies are clearly only dependent on the atomic positions, your initial RDKit representation likely contains a mapping from a molecular topology (which includes bonds and bond orders) that allows atom indices to be uniquely identified within the molecular topology. It would be great if this topology information could be provided as well---perhaps as a compressed multimolecule mol2 file?

isayev commented 6 years ago

@jchodera John, sorry for a sluggish response. We have molecular topologies as SMILES strings. Let me find them for you.

jchodera commented 6 years ago

The SMILES strings are still not quite enough to uniquely identify which atom indices go with which atoms in the molecular topology. Did you at least use a deterministic piece of code to go from SMILES -> unique atom ordering?

jchodera commented 6 years ago

We actually had a timely discussion with @dgasmith this weekend about how we might better facilitate interoperability between quantum chemistry and molecular mechanics topology representations, especially in light of the new JSON schema being developed for quantum chemistry.

isayev commented 6 years ago

Ugh... true. We would need to ask in-house Jedi master @Jussmith01 for that.

isayev commented 6 years ago

JSON is nice, I will have a look. Hopefully not like in XKCD comic

jchodera commented 6 years ago

Ugh... true. We would need to ask in-house Jedi master @Jussmith01 for that.

SMILES and a short piece of code to reproducibly generate the molecular topology would be sufficient, but it would be much more robust to just have a big multi-molecule mol2 or SDF tarball that has the same database keys since this would guard against changes to upstream codes (like RDKit) that change atom ordering.

roitberg commented 6 years ago

Hi John, We can certainly go xyz --> mol2, but I am not sure the bond orders, etc will be there. I am also slightly worried about the following. Take molecule i, for which we have N ‘conformations’. Since we are doing some pretty serious normal modes displacements for sampling, one can imagine conformations having different bond orders according to whatever algorithm one uses to create the mol2 file. This is either good news (since ir is possible that stretching a bond can give you a change in bond order) or bad news (if somehow you will use this data assuming the same bond orders for all conformers).

jchodera commented 6 years ago

We can certainly go xyz --> mol2, but I am not sure the bond orders, etc will be there.

In the RDKit stage of your processing, these molecules must have a well-defined set of bond orders and topology---otherwise, RDKit would not have been able to process them. That representation should be sufficient to write out as mol2 or SDF format.

You are certainly correct that the subsequent perturbations might distort the bond orders or even perceived chemical connectivity! It may be possible for us to effectively deal with this through the computation of bond orders (e.g. Wiberg bond orders), though I'm not sure we could afford to do the same level of theory to evaluate this that you've done.

Even despite the chemical distortion issue, I think it would be super useful if the provenance information for what chemical topology these structures originated from (via mol2/SDF) was available.

P.S. Happy New Year!