PCQM4Mv2 sdf problem - Githubissues

PierreHao commented 2 years ago

HI, I try to get the mol from pcqm4m-v2-train.sdf, and compare the structure with mol from rdkit.Chem.MolFromSmiles, for example: obabel -ixyz 1.xyz -osmi -O 1.smi, we get the smiles CC(=O)N(C)/C=C/c1ccc(cc1OC)OC, but origin is COc1cc(OC)ccc1/C=C/N(C(=O)C)C then I do gnn inference with these two graphs, the final result is a little different。 it seems that 2D graph is not the same? btw sdf does not provide xyz of H like pcqm4m-v2_xyz.zip ?

weihua916 commented 2 years ago

Hi! Why did you use 1.xyz? You should not use pcqm4m-v2_xyz.zip because it is outdated. Did you use pcqm4m-v2-train.sdf?

weihua916 commented 2 years ago

You should be able to reproduce the 2D graph by

from rdkit import Chem

suppl = Chem.SDMolSupplier('pcqm4m-v2-train.sdf')
for idx, mol in enumerate(suppl):
    print(f'{idx}-th rdkit mol obj: {mol}')

See here for more details.

PierreHao commented 2 years ago

@weihua916 thank you for your reply，I have also tried pcqm4m-v2-train.sdf，but this one does not provide atom H。And inference result with the 2d graph of pcqm4m-v2-train.sdf is a little different from the origin data.csv.gz (parts are the same)。I am doing debug to find the reason now, do you find the same problem?

weihua916 commented 2 years ago

Correct, we do not provide atom H for some chemistry reason (@nakatamaho can elaborate).

And yes, it is a known issue that some 2D graphs of pcqm4m-v2-train.sdf are different. Here, we wrote "Known issue: A very small number of training molecules (around 46 out of 3,378,606) have 2D graph structures that are inconsistent with the ones calculated from SMILES. These molecules often involve Si atom(s). For the rest of the training molecules, the 2D graphs constructed from SDF and SMILES are identical (even though the atom-to-atom correspondence is not available)."

PierreHao commented 2 years ago

I have read your description carefully before. Now, I find that we get different gnn inference result with the first molecular on 2D graph (which does not have Si)。Maybe it's a code problem, I'll check again.

nakatamaho commented 2 years ago

Hi, PierreHao,

SMILES from rdkit.Chem.MolFromSmiles and SMILES by Open Babel can (slightly) be different.
we contain Hydrogen atoms in SDF. If you see the file "pcqm4m-v2-train.sdf" you'll find following part: /Volumes/PubChemQCDataBaseWork/pubchemqc2017database/xyz/00000000_00009999/1.xyz OpenBabel02162213453D

34 34 0 0 0 0 0 0 0 0999 V2000 7.0068 1.8970 3.2727 C 0 0 0 0 0 0 0 0 0 0 0 0 4.4650 -0.2257 1.3353 C 0 0 0 0 0 0 0 0 0 0 0 0 0.4432 2.1011 9.5057 C 0 0 0 0 0 0 0 0 0 0 0 0 0.3366 -2.5350 5.1551 C 0 0 0 0 0 0 0 0 0 0 0 0 2.6396 1.6000 5.9257 C 0 0 0 0 0 0 0 0 0 0 0 0 1.9474 1.8190 7.1190 C 0 0 0 0 0 0 0 0 0 0 0 0 3.1830 0.2123 3.8873 C 0 0 0 0 0 0 0 0 0 0 0 0 4.3870 0.7439 3.5935 C 0 0 0 0 0 0 0 0 0 0 0 0 0.7754 -0.2729 6.7631 C 0 0 0 0 0 0 0 0 0 0 0 0 6.3568 1.0842 2.1622 C 0 0 0 0 0 0 0 0 0 0 0 0 2.4537 0.4611 5.1348 C 0 0 0 0 0 0 0 0 0 0 0 0 1.0098 0.8730 7.5380 C 0 0 0 0 0 0 0 0 0 0 0 0 1.4835 -0.4763 5.5827 C 0 0 0 0 0 0 0 0 0 0 0 0 5.0856 0.5626 2.3959 N 0 0 0 0 0 0 0 0 0 0 0 0 6.9369 0.8984 1.1006 O 0 0 0 0 0 0 0 0 0 0 0 0 0.2593 0.9658 8.6759 O 0 0 0 0 0 0 0 0 0 0 0 0 1.3108 -1.5731 4.7848 O 0 0 0 0 0 0 0 0 0 0 0 0 7.1466 1.3059 4.1845 H 0 0 0 0 0 0 0 0 0 0 0 0 6.4174 2.7835 3.5327 H 0 0 0 0 0 0 0 0 0 0 0 0 7.9816 2.2170 2.9038 H 0 0 0 0 0 0 0 0 0 0 0 0 4.3066 -1.2574 1.6696 H 0 0 0 0 0 0 0 0 0 0 0 0 5.1340 -0.2156 0.4776 H 0 0 0 0 0 0 0 0 0 0 0 0 3.4967 0.2086 1.0638 H 0 0 0 0 0 0 0 0 0 0 0 0 -0.2404 1.9714 10.3469 H 0 0 0 0 0 0 0 0 0 0 0 0 0.1957 3.0318 8.9780 H 0 0 0 0 0 0 0 0 0 0 0 0 1.4730 2.1641 9.8816 H 0 0 0 0 0 0 0 0 0 0 0 0 0.3625 -3.3009 4.3776 H 0 0 0 0 0 0 0 0 0 0 0 0 -0.6682 -2.0951 5.1999 H 0 0 0 0 0 0 0 0 0 0 0 0 0.5723 -2.9938 6.1242 H 0 0 0 0 0 0 0 0 0 0 0 0 3.3352 2.3628 5.5870 H 0 0 0 0 0 0 0 0 0 0 0 0 2.1307 2.7235 7.6863 H 0 0 0 0 0 0 0 0 0 0 0 0 2.7052 -0.4677 3.1921 H 0 0 0 0 0 0 0 0 0 0 0 0 4.8961 1.3664 4.3173 H 0 0 0 0 0 0 0 0 0 0 0 0 0.0389 -0.9814 7.1205 H 0 0 0 0 0 0 0 0 0 0 0 0

In the last 17 lines, you'll find Hydrogen atoms.

PierreHao commented 2 years ago

@nakatamaho I have seen it, thanks

nakatamaho commented 2 years ago

it seems that 2D graph is not the same? It can be. We calculated SMILES and SDF by xyz files using Open Babel. There is no rigorous or standard algorithm to convert atomic xyz coordination to SMILES, the resultant SMILES strings can be slightly different. However, these should not be large differences.

PierreHao commented 2 years ago

yes，a slightly different. My experimental data shows a performance difference of 0.005 on pcqv2 valid set。 For the 2nd OGB-LSC，we should not use atom H information from xyz file?

nakatamaho commented 2 years ago

Note that you can also extract molecular graphs from SDF and including bond order as well.

nakatamaho commented 2 years ago

For the 2nd OGB-LSC，we should not use atom H information from xyz file? It is up to you! In this SDF, we provide all information for the molecules; xyz coordinates of atoms for each molecule. (and molecules are all neutral)

nakatamaho commented 2 years ago

First, SDF is just a naive collection of mol files. Second, you don't need xyz. SDF contains everything you need! Anyway: I attached a.sdf.txt. Please save as a.sdf. a.sdf.txt Then, $ obabel -isdf a.sdf -o xyz 34 /Volumes/PubChemQCDataBaseWork/pubchemqc2017database/xyz/00000000_00009999/1.xyz C 7.00680 1.89700 3.27270 C 4.46500 -0.22570 1.33530 C 0.44320 2.10110 9.50570 C 0.33660 -2.53500 5.15510 C 2.63960 1.60000 5.92570 C 1.94740 1.81900 7.11900 C 3.18300 0.21230 3.88730 C 4.38700 0.74390 3.59350 C 0.77540 -0.27290 6.76310 C 6.35680 1.08420 2.16220 C 2.45370 0.46110 5.13480 C 1.00980 0.87300 7.53800 C 1.48350 -0.47630 5.58270 N 5.08560 0.56260 2.39590 O 6.93690 0.89840 1.10060 O 0.25930 0.96580 8.67590 O 1.31080 -1.57310 4.78480 H 7.14660 1.30590 4.18450 H 6.41740 2.78350 3.53270 H 7.98160 2.21700 2.90380 H 4.30660 -1.25740 1.66960 H 5.13400 -0.21560 0.47760 H 3.49670 0.20860 1.06380 H -0.24040 1.97140 10.34690 H 0.19570 3.03180 8.97800 H 1.47300 2.16410 9.88160 H 0.36250 -3.30090 4.37760 H -0.66820 -2.09510 5.19990 H 0.57230 -2.99380 6.12420 H 3.33520 2.36280 5.58700 H 2.13070 2.72350 7.68630 H 2.70520 -0.46770 3.19210 H 4.89610 1.36640 4.31730 H 0.03890 -0.98140 7.12050

PierreHao commented 2 years ago

@nakatamaho Thank you so much

snap-stanford / ogb

PCQM4Mv2 sdf problem #336