Closed HannesStark closed 3 years ago
Thanks Hannes, I'm glad you're enjoying it!
For mapping GEOM to QM9, I'd recommend comparing SMILES strings. You can convert a SMILES string to its canonical form like this:
from rdkit import Chem
canon_smiles = Chem.MolToSmiles(Chem.MolFromSmiles(smiles))
Does this help?
I haven't thoroughly compared the GEOM lowest energy conformers to the QM9 structures. My guess is that they'd be pretty similar. There's not a lot of flexibility in the QM9 molecules, so when they were originally optimized they probably didn't get stuck in any flexible local minima. But you'd probably have to check and confirm!
Simon
Thank you very much Simon.
canon_smiles = Chem.MolToSmiles(Chem.MolFromSmiles(smiles))
and still contain substrings such as C@@H
that do not appear in any of the smiles strings of the QM9 dataset.Additionally, GEOM contains smiles such as this one Cn1c([NH])cn[nH]c1=O
for which Chem.MolFromSmiles(smiles)
returns None
Sorry for my missing knowledge about SMILES representations but I would be very grateful for any further help!
Hannes
Good point. The @@
indicates chirality. I believe that each geometry in QM9 has two sets of SMILES, one which was used to seed the geometry and perform optimization, and one that was inferred from the geometry using OpenBabel. For example, I just took a look at dsgdb9nsd_007952.xyz
in QM9, which has the SMILES strings CC1CC(=O)CC1C
and C[C@H]1CC(=O)C[C@H]1C
, and the corresponding inchis 1S/C7H12O/c1-5-3-7(8)4-6(5)2/h5-6H,3-4H2,1-2H3
and 1S/C7H12O/c1-5-3-7(8)4-6(5)2/h5-6H,3-4H2,1-2H3/t5-,6+
. I think that if you use this second SMILES for each species, then you should be able to get a match.
If some don't match then you might also want to try the inchi keys. For example, for the QM9 file I mentioned above, you could do:
from rdkit.Chem.inchi import InchiToInchiKey, MolToInchiKey
from rdkit.Chem import MolFromSmiles
# from QM9 file
qm9_inchi = '1S/C7H12O/c1-5-3-7(8)4-6(5)2/h5-6H,3-4H2,1-2H3'
qm9_inchikey = InchiToInchiKey('InChI=' + qm9_inchi)
# from GEOM
geom_smiles = 'CC1CC(=O)CC1C'
geom_mol = MolFromSmiles(geom_smiles)
geom_inchikey = MolToInchiKey(geom_mol)
As for the SMILES that returns None - the example you gave seems to be an invalid SMILES, so that's concerning. Do you know how many of those there are? I will definitely look into those.
Simon
Just to clarify the chirality point - I think that none of the QM9 SMILES originally had chirality specified. But if you have a chiral center, then when you create a geometry from a SMILES, it will have a specified chirality, probably chosen randomly. So when you convert that geometry back into a SMILES, it will have that chirality specified through @@
.
For these 727 smiles strings, I get None out of all 133k SMILES like this:
import msgpack
from rdkit.Chem import MolFromSmiles
unpacker = msgpack.Unpacker(open('dataset/GEOM/qm9_crude.msgpack', "rb"))
none_smiles = []
for pack in tqdm(unpacker):
for i, smiles in enumerate(pack.keys()):
mol = MolFromSmiles(smiles)
if mol == None:
none_smiles.append(smiles)
print(none_smiles)
print(len(none_smiles))
Should I maybe not be using the smiles given as the keys of the dictionary?
Hmm that's unfortunate. I think SMILES as keys is the way to go. Does the same problem happen when you use the RDKit pickles instead? If it's still a problem, then for now I would just skip over those geometries.
Any luck with the SMILES that aren't none?
I have not tried using the pickle files yet as I was hesitant to download the 49GB file that also includes the GEOM Drugs molecules. I am trying using those right now.
For the other SMILES, everything is working out. Thank you very much for your help!
On another note, the paper states that there are 304,466 molecules in the drugs dataset.
I can, however, only find 292,035 molecules in the drugs_crude.msgpack
file.
Is the rest only provided in the RDKit pickles, or am I missing something else?
And secondly, when working with the pickle files now, I found that for 173 SMILES (such as 'C=CCn1c(SCC(=O)N2CCOCC2)nc2scc(-c3ccco3)c2c1=O') there is no ensembleenergy as key in the dictionary. For which ones is that the case? Thanks for any additional information.
Thanks for letting me know that the size of the pickles is an issue. Would it be helpful for me to post the datasets separately so they don't have to be downloaded together?
As for the number of molecules, we haven't been updating the messagepack files but have been updating the pickles. Since we first posted the dataset we added 12k species to the pickles. This is in the README but evidently it was easy to miss -any suggestions on where to put this so that everybody sees it?
For the 173 smiles, are they only missing ensembleenergy
or all the other conformer-related properties too?
For me, it would have been better in the beginning when I just wanted to use the GEOM QM9 dataset if the downloads were split up.
I do not have any suggestions for where to describe that the messagepack files are not updated, and I think I just should have read the README more carefully.
Only the SARS properties were in these 173 molecules, and the energy and entropy properties were missing.
That's great to know, thank you! I will work on splitting them up. And sounds good about the README - we've all been there.
As for the missing molecules, I think the crest simulations either failed or timed out for these species, so there is no crest data available. When making the dataset I tried to exclude any species that didn't have crest data. But evidently that didn't work every time, so I'll make sure to correct that.
Anything else I can help with?
I have no further questions, and my models are happily training. A thousand thanks for your great help!
That's great to hear! My pleasure :)
Hello,
Thanks for the great dataset! The qm9 dataset already has a single conformer for every molecule, and they provide molecular property labels for the molecules.
I would like to use the additional conformers in GEOM qm9 to predict the properties in the qm9 dataset. What would you recommend as the best way to map between qm9 and GEOM qm9?
Furthermore, do you know what differences I should expect between the single conformation given in qm9 and the lowest energy conformation provided in GEOM qm9?
Thank you very much for any information!