learningmatter-mit / geom

GEOM: Energy-annotated molecular conformations
195 stars 24 forks source link

Map QM9 GEOM to QM9 #5

Closed HannesStark closed 3 years ago

HannesStark commented 3 years ago

Hello,

Thanks for the great dataset! The qm9 dataset already has a single conformer for every molecule, and they provide molecular property labels for the molecules.

I would like to use the additional conformers in GEOM qm9 to predict the properties in the qm9 dataset. What would you recommend as the best way to map between qm9 and GEOM qm9?

Furthermore, do you know what differences I should expect between the single conformation given in qm9 and the lowest energy conformation provided in GEOM qm9?

Thank you very much for any information!

simonaxelrod commented 3 years ago

Thanks Hannes, I'm glad you're enjoying it!

  1. For mapping GEOM to QM9, I'd recommend comparing SMILES strings. You can convert a SMILES string to its canonical form like this:

    from rdkit import Chem
    canon_smiles = Chem.MolToSmiles(Chem.MolFromSmiles(smiles))

    Does this help?

  2. I haven't thoroughly compared the GEOM lowest energy conformers to the QM9 structures. My guess is that they'd be pretty similar. There's not a lot of flexibility in the QM9 molecules, so when they were originally optimized they probably didn't get stuck in any flexible local minima. But you'd probably have to check and confirm!

Simon

HannesStark commented 3 years ago

Thank you very much Simon.

  1. The smiles strings provided in the keys of the GEOM dataset do not seem to change after doing canon_smiles = Chem.MolToSmiles(Chem.MolFromSmiles(smiles)) and still contain substrings such as C@@H that do not appear in any of the smiles strings of the QM9 dataset.

Additionally, GEOM contains smiles such as this one Cn1c([NH])cn[nH]c1=O for which Chem.MolFromSmiles(smiles) returns None

Sorry for my missing knowledge about SMILES representations but I would be very grateful for any further help!

Hannes

simonaxelrod commented 3 years ago

Good point. The @@ indicates chirality. I believe that each geometry in QM9 has two sets of SMILES, one which was used to seed the geometry and perform optimization, and one that was inferred from the geometry using OpenBabel. For example, I just took a look at dsgdb9nsd_007952.xyz in QM9, which has the SMILES strings CC1CC(=O)CC1C and C[C@H]1CC(=O)C[C@H]1C, and the corresponding inchis 1S/C7H12O/c1-5-3-7(8)4-6(5)2/h5-6H,3-4H2,1-2H3 and 1S/C7H12O/c1-5-3-7(8)4-6(5)2/h5-6H,3-4H2,1-2H3/t5-,6+. I think that if you use this second SMILES for each species, then you should be able to get a match.

If some don't match then you might also want to try the inchi keys. For example, for the QM9 file I mentioned above, you could do:

from rdkit.Chem.inchi import InchiToInchiKey, MolToInchiKey
from rdkit.Chem import MolFromSmiles

# from QM9 file
qm9_inchi = '1S/C7H12O/c1-5-3-7(8)4-6(5)2/h5-6H,3-4H2,1-2H3'
qm9_inchikey = InchiToInchiKey('InChI=' + qm9_inchi)

# from GEOM
geom_smiles = 'CC1CC(=O)CC1C'
geom_mol = MolFromSmiles(geom_smiles)
geom_inchikey = MolToInchiKey(geom_mol)

As for the SMILES that returns None - the example you gave seems to be an invalid SMILES, so that's concerning. Do you know how many of those there are? I will definitely look into those.

Simon

simonaxelrod commented 3 years ago

Just to clarify the chirality point - I think that none of the QM9 SMILES originally had chirality specified. But if you have a chiral center, then when you create a geometry from a SMILES, it will have a specified chirality, probably chosen randomly. So when you convert that geometry back into a SMILES, it will have that chirality specified through @@.

HannesStark commented 3 years ago

For these 727 smiles strings, I get None out of all 133k SMILES like this:

import msgpack
from rdkit.Chem import MolFromSmiles

unpacker = msgpack.Unpacker(open('dataset/GEOM/qm9_crude.msgpack', "rb"))
none_smiles = []
for pack in tqdm(unpacker):
    for i, smiles in enumerate(pack.keys()):
        mol = MolFromSmiles(smiles)
        if mol == None:
            none_smiles.append(smiles)

print(none_smiles)
print(len(none_smiles))

Should I maybe not be using the smiles given as the keys of the dictionary?

simonaxelrod commented 3 years ago

Hmm that's unfortunate. I think SMILES as keys is the way to go. Does the same problem happen when you use the RDKit pickles instead? If it's still a problem, then for now I would just skip over those geometries.

Any luck with the SMILES that aren't none?

HannesStark commented 3 years ago

I have not tried using the pickle files yet as I was hesitant to download the 49GB file that also includes the GEOM Drugs molecules. I am trying using those right now.

For the other SMILES, everything is working out. Thank you very much for your help!

HannesStark commented 3 years ago

On another note, the paper states that there are 304,466 molecules in the drugs dataset. I can, however, only find 292,035 molecules in the drugs_crude.msgpack file. Is the rest only provided in the RDKit pickles, or am I missing something else?

And secondly, when working with the pickle files now, I found that for 173 SMILES (such as 'C=CCn1c(SCC(=O)N2CCOCC2)nc2scc(-c3ccco3)c2c1=O') there is no ensembleenergy as key in the dictionary. For which ones is that the case? Thanks for any additional information.

simonaxelrod commented 3 years ago

Thanks for letting me know that the size of the pickles is an issue. Would it be helpful for me to post the datasets separately so they don't have to be downloaded together?

As for the number of molecules, we haven't been updating the messagepack files but have been updating the pickles. Since we first posted the dataset we added 12k species to the pickles. This is in the README but evidently it was easy to miss -any suggestions on where to put this so that everybody sees it?

For the 173 smiles, are they only missing ensembleenergy or all the other conformer-related properties too?

HannesStark commented 3 years ago

For me, it would have been better in the beginning when I just wanted to use the GEOM QM9 dataset if the downloads were split up.

I do not have any suggestions for where to describe that the messagepack files are not updated, and I think I just should have read the README more carefully.

Only the SARS properties were in these 173 molecules, and the energy and entropy properties were missing.

simonaxelrod commented 3 years ago

That's great to know, thank you! I will work on splitting them up. And sounds good about the README - we've all been there.

As for the missing molecules, I think the crest simulations either failed or timed out for these species, so there is no crest data available. When making the dataset I tried to exclude any species that didn't have crest data. But evidently that didn't work every time, so I'll make sure to correct that.

Anything else I can help with?

HannesStark commented 3 years ago

I have no further questions, and my models are happily training. A thousand thanks for your great help!

simonaxelrod commented 3 years ago

That's great to hear! My pleasure :)