Closed PierreHao closed 2 years ago
Hi! Why did you use 1.xyz? You should not use pcqm4m-v2_xyz.zip
because it is outdated. Did you use pcqm4m-v2-train.sdf?
You should be able to reproduce the 2D graph by
from rdkit import Chem
suppl = Chem.SDMolSupplier('pcqm4m-v2-train.sdf')
for idx, mol in enumerate(suppl):
print(f'{idx}-th rdkit mol obj: {mol}')
See here for more details.
@weihua916 thank you for your reply,I have also tried pcqm4m-v2-train.sdf,but this one does not provide atom H。And inference result with the 2d graph of pcqm4m-v2-train.sdf is a little different from the origin data.csv.gz (parts are the same)。I am doing debug to find the reason now, do you find the same problem?
Correct, we do not provide atom H for some chemistry reason (@nakatamaho can elaborate).
And yes, it is a known issue that some 2D graphs of pcqm4m-v2-train.sdf are different. Here, we wrote "Known issue: A very small number of training molecules (around 46 out of 3,378,606) have 2D graph structures that are inconsistent with the ones calculated from SMILES. These molecules often involve Si atom(s). For the rest of the training molecules, the 2D graphs constructed from SDF and SMILES are identical (even though the atom-to-atom correspondence is not available)."
I have read your description carefully before. Now, I find that we get different gnn inference result with the first molecular on 2D graph (which does not have Si)。Maybe it's a code problem, I'll check again.
Hi, PierreHao,
we contain Hydrogen atoms in SDF. If you see the file "pcqm4m-v2-train.sdf" you'll find following part: /Volumes/PubChemQCDataBaseWork/pubchemqc2017database/xyz/00000000_00009999/1.xyz OpenBabel02162213453D
34 34 0 0 0 0 0 0 0 0999 V2000 7.0068 1.8970 3.2727 C 0 0 0 0 0 0 0 0 0 0 0 0 4.4650 -0.2257 1.3353 C 0 0 0 0 0 0 0 0 0 0 0 0 0.4432 2.1011 9.5057 C 0 0 0 0 0 0 0 0 0 0 0 0 0.3366 -2.5350 5.1551 C 0 0 0 0 0 0 0 0 0 0 0 0 2.6396 1.6000 5.9257 C 0 0 0 0 0 0 0 0 0 0 0 0 1.9474 1.8190 7.1190 C 0 0 0 0 0 0 0 0 0 0 0 0 3.1830 0.2123 3.8873 C 0 0 0 0 0 0 0 0 0 0 0 0 4.3870 0.7439 3.5935 C 0 0 0 0 0 0 0 0 0 0 0 0 0.7754 -0.2729 6.7631 C 0 0 0 0 0 0 0 0 0 0 0 0 6.3568 1.0842 2.1622 C 0 0 0 0 0 0 0 0 0 0 0 0 2.4537 0.4611 5.1348 C 0 0 0 0 0 0 0 0 0 0 0 0 1.0098 0.8730 7.5380 C 0 0 0 0 0 0 0 0 0 0 0 0 1.4835 -0.4763 5.5827 C 0 0 0 0 0 0 0 0 0 0 0 0 5.0856 0.5626 2.3959 N 0 0 0 0 0 0 0 0 0 0 0 0 6.9369 0.8984 1.1006 O 0 0 0 0 0 0 0 0 0 0 0 0 0.2593 0.9658 8.6759 O 0 0 0 0 0 0 0 0 0 0 0 0 1.3108 -1.5731 4.7848 O 0 0 0 0 0 0 0 0 0 0 0 0 7.1466 1.3059 4.1845 H 0 0 0 0 0 0 0 0 0 0 0 0 6.4174 2.7835 3.5327 H 0 0 0 0 0 0 0 0 0 0 0 0 7.9816 2.2170 2.9038 H 0 0 0 0 0 0 0 0 0 0 0 0 4.3066 -1.2574 1.6696 H 0 0 0 0 0 0 0 0 0 0 0 0 5.1340 -0.2156 0.4776 H 0 0 0 0 0 0 0 0 0 0 0 0 3.4967 0.2086 1.0638 H 0 0 0 0 0 0 0 0 0 0 0 0 -0.2404 1.9714 10.3469 H 0 0 0 0 0 0 0 0 0 0 0 0 0.1957 3.0318 8.9780 H 0 0 0 0 0 0 0 0 0 0 0 0 1.4730 2.1641 9.8816 H 0 0 0 0 0 0 0 0 0 0 0 0 0.3625 -3.3009 4.3776 H 0 0 0 0 0 0 0 0 0 0 0 0 -0.6682 -2.0951 5.1999 H 0 0 0 0 0 0 0 0 0 0 0 0 0.5723 -2.9938 6.1242 H 0 0 0 0 0 0 0 0 0 0 0 0 3.3352 2.3628 5.5870 H 0 0 0 0 0 0 0 0 0 0 0 0 2.1307 2.7235 7.6863 H 0 0 0 0 0 0 0 0 0 0 0 0 2.7052 -0.4677 3.1921 H 0 0 0 0 0 0 0 0 0 0 0 0 4.8961 1.3664 4.3173 H 0 0 0 0 0 0 0 0 0 0 0 0 0.0389 -0.9814 7.1205 H 0 0 0 0 0 0 0 0 0 0 0 0
In the last 17 lines, you'll find Hydrogen atoms.
@nakatamaho I have seen it, thanks
yes,a slightly different. My experimental data shows a performance difference of 0.005 on pcqv2 valid set。 For the 2nd OGB-LSC,we should not use atom H information from xyz file?
Note that you can also extract molecular graphs from SDF and including bond order as well.
For the 2nd OGB-LSC,we should not use atom H information from xyz file? It is up to you! In this SDF, we provide all information for the molecules; xyz coordinates of atoms for each molecule. (and molecules are all neutral)
First, SDF is just a naive collection of mol files. Second, you don't need xyz. SDF contains everything you need! Anyway: I attached a.sdf.txt. Please save as a.sdf. a.sdf.txt Then, $ obabel -isdf a.sdf -o xyz 34 /Volumes/PubChemQCDataBaseWork/pubchemqc2017database/xyz/00000000_00009999/1.xyz C 7.00680 1.89700 3.27270 C 4.46500 -0.22570 1.33530 C 0.44320 2.10110 9.50570 C 0.33660 -2.53500 5.15510 C 2.63960 1.60000 5.92570 C 1.94740 1.81900 7.11900 C 3.18300 0.21230 3.88730 C 4.38700 0.74390 3.59350 C 0.77540 -0.27290 6.76310 C 6.35680 1.08420 2.16220 C 2.45370 0.46110 5.13480 C 1.00980 0.87300 7.53800 C 1.48350 -0.47630 5.58270 N 5.08560 0.56260 2.39590 O 6.93690 0.89840 1.10060 O 0.25930 0.96580 8.67590 O 1.31080 -1.57310 4.78480 H 7.14660 1.30590 4.18450 H 6.41740 2.78350 3.53270 H 7.98160 2.21700 2.90380 H 4.30660 -1.25740 1.66960 H 5.13400 -0.21560 0.47760 H 3.49670 0.20860 1.06380 H -0.24040 1.97140 10.34690 H 0.19570 3.03180 8.97800 H 1.47300 2.16410 9.88160 H 0.36250 -3.30090 4.37760 H -0.66820 -2.09510 5.19990 H 0.57230 -2.99380 6.12420 H 3.33520 2.36280 5.58700 H 2.13070 2.72350 7.68630 H 2.70520 -0.46770 3.19210 H 4.89610 1.36640 4.31730 H 0.03890 -0.98140 7.12050
@nakatamaho Thank you so much
HI, I try to get the mol from pcqm4m-v2-train.sdf, and compare the structure with mol from rdkit.Chem.MolFromSmiles, for example: obabel -ixyz 1.xyz -osmi -O 1.smi, we get the smiles CC(=O)N(C)/C=C/c1ccc(cc1OC)OC, but origin is COc1cc(OC)ccc1/C=C/N(C(=O)C)C then I do gnn inference with these two graphs, the final result is a little different。 it seems that 2D graph is not the same? btw sdf does not provide xyz of H like pcqm4m-v2_xyz.zip ?