snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 398 forks source link

Why ogbg-molhiv has disconnected graphs? #109

Closed ShichangZh closed 3 years ago

ShichangZh commented 3 years ago

May I ask the question why ogbg-molhiv has disconnected graphs? I thought these graphs are molecules and should naturally be connected? For example, I plotted the graph with index 3732, and it looks like the following. May I ask why node 27, which is a single atom, considered part of this molecule?

Screen Shot 2021-03-02 at 9 18 31 PM

weihua916 commented 3 years ago

Good question. On a very rare occasion, the dataset does contain isolated atoms. Unfortunately, I am not an expert in chemistry, so I do not know the answer, but I am pretty sure the dataset is correct as it is.

weihua916 commented 3 years ago

Hi! I scrutinized the data a bit and found the original molecule is indeed disconnected, as shown below.

image

Below is ipython notebook script that I used to obtain the figure above.

import pandas as pd
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole #Needed to show molecules

molhiv_df = pd.read_csv('ogbg_molhiv/mapping/mol.csv.gz')
smiles = molhiv_df.smiles[3732]
# CCC(O)(C(=O)O)c1cc2n(c(=O)c1CO)Cc1cc3ccccc3nc1-2.[NaH]
mol = Chem.MolFromSmiles(smiles)
mol

Hope this helps!

ShichangZh commented 3 years ago

Thank you for the quick response! Yes, it is helpful. I guess I will consult some experts in chemistry to learn more about it.

chupvl commented 1 year ago

Answering for the record: those deattached graphs are so-called salts (e.g. NaCl, KCl), that are not needed in most of the cases. Easiest approach to remove them - keep the largest graph, and remove the smallest.