snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

Relating the xyz files to the molecules in PCQM4Mv2 #297

Closed DomInvivo closed 2 years ago

DomInvivo commented 2 years ago

Hello, I would like to use the .xyz files provided for the PCQM4Mv2. However, the ordering of the atoms is different in the file than in the graph representation returned by rdkit. Below, you can see that the atoms are sorted in alphabetical order. image

However, rdkit does not do this kind of sorting when generating a molecular graph. In the screenshot below, you can see that the atom ordering is different. image

So the question is, how do I correspond the i-th atom in the molecule object to the j-th atom in the xyz file? I believe an alphabetical sorting would not work since there are usually ~4 unique symbols for ~30 atoms, unless both the initial ordering and sorting algorithm are identical.

nakatamaho commented 2 years ago

Hi, DomInvivo Your question looks interesting but I cannot answer. Since it depends on which algorithm uses in RDkit to generate a molecular graph from the original xyz file. Does this URL help you? https://stackoverflow.com/questions/51195392/smiles-from-graph

DomInvivo commented 2 years ago

Hello @nakatamaho. Unfortunately, the URL that you provided is not helpful since the .xyz files does not have an adjacency matrix or the bond types. The .xyz file is un-usable without any form of ordering to the nodes, or any adjacency information. See thread here. Basically, rdkit cannot generate a molecule from an .xyz file containing positions.

The only way seems to be trying to match the bond-distances to known bond-lengths, but it will be complicated and have errors.

Are you aware of any group/project that used the provided .xyz files? Because from my point of view, the coordinates should be re-generated to either provide bond information, or using an .sdf file instead SDF explained here

nakatamaho commented 2 years ago

Unfortunately, the URL that you provided is not helpful since the .xyz files does not have an adjacency matrix or the bond types.

You're right. In principle, you must recalculate the adjacency matrix and bond types from the internuclear distances. Bond information and bond orders are not rigorous concepts; you cannot derive from the Schrodinger equation of atoms and molecules or 1st principle calculations. Thus, there are some ambiguities if we use bond information and orders.

In practice, from .xyz files you can generate .sdf using Open Babel. You will see bond information and orders in the converted sdfs. We calculated canonical SMILES from xyz using Open Babel and presented them in the CSV file. I believe RDKit is possible to calculate SMILES or SDF from xyz. Bond order and information might not be 100% the same, nevertheless, it would (and should) not harm your results.

DomInvivo commented 2 years ago

If possible, could you provide me with the code that you used? This would ensure that we both follow the same protocol and yield the same SDF files. Otherwise, it's not a major problem, I will figure it out. Thank you for the above information!

nakatamaho commented 2 years ago

Hi Domlnvivo,

Yes, I also believe reproducibility is very important. Note that in the SDF format the coordinate of the atom is represented by five significant digits, whereas in the XYZ format the coordinate of the atom is represented by six significant digits. Nevertheless, the XYZ format is somewhat not strictly complied with. Namely, in some XYZ files, the coordinate of the atom is represented by 16 significant digits!

You may want to use the numdiff utility (https://www.nongnu.org/numdiff/) to compare the numerical values in two files.

The following command is converting the xyz file to the sdf. % obabel -ixyz xtbopt.xyz -osdf energy: -125.620136609389 gnorm: 0.146629678859 xtb: 6.4.0 (unknown) OpenBabel02092220223D

60 66 0 0 1 0 0 0 0 0999 V2000 -0.8215 4.6400 -1.3750 C 0 0 0 0 0 0 0 0 0 0 0 0 5.3887 2.6588 -0.1823 C 0 0 0 0 0 0 0 0 0 0 0 0 1.0389 -5.1231 2.2356 C 0 0 0 0 0 0 0 0 0 0 0 0 2.8288 -4.0304 1.0585 C 0 0 0 0 0 0 0 0 0 0 0 0 -5.3712 -1.5066 1.2964 C 0 0 0 0 0 0 0 0 0 0 0 0 0.2108 -5.2594 3.2012 C 0 0 0 0 0 0 0 0 0 0 0 0 -4.3025 3.0582 -1.4346 C 0 0 0 0 0 0 0 0 0 0 0 0 -5.5911 -0.2415 1.4573 C 0 0 0 0 0 2 0 0 0 0 0 0 0.6382 3.0438 -4.5356 C 0 0 0 0 0 2 0 0 0 0 0 0 5.8569 1.2162 -0.4982 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.6005 -4.8261 -1.2103 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.6645 0.8880 -5.1544 C 0 0 0 0 0 0 0 0 0 0 0 0 0.8386 1.2694 -4.6475 C 0 0 0 0 0 0 0 0 0 0 0 0 3.1427 -0.0766 -4.2517 C 0 0 0 0 0 0 0 0 0 0 0 0 0.6876 -6.3205 -2.5049 C 0 0 0 0 0 2 0 0 0 0 0 0 -2.9166 1.5719 -4.5635 C 0 0 2 0 0 3 0 0 0 0 0 0 3.5044 1.2891 -4.7229 C 0 0 0 0 0 2 0 0 0 0 0 0 5.0271 -1.1302 0.1228 C 0 0 0 0 0 0 0 0 0 0 0 0 ....

weihua916 commented 2 years ago

Hi! We have just uploaded the SDF file, from which you can create 2D graphs with XYZ coordinate. Please see here for the tutorial on how to use the SDF file.

DomInvivo commented 2 years ago

Thanks a lot @nakatamaho @weihua916 ! I forgot to message you that I did it on my side too, OpenBabel worked great!