snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

PCQv2 dataset problem #354

Closed PierreHao closed 1 year ago

PierreHao commented 1 year ago

@weihua916 @nakatamaho Hi,when I run pcqv2 dataset with a gin,I have found that the inference result on the graph obtained by sdf mol is different with the graph obtained by raw smiles, and the ratio is about 40%. The main difference is the chirality of the atoms. Is there any way to make them consistent?

weihua916 commented 1 year ago

Hi! Thanks for letting us know this. That's an interesting observation.

I used the code below to do the quick sanity check. where I compared the morgan fingerprints (similar to how GNNs work, see this) between the molecules calculated from SMILES versus SDF. I calculated the mismatch ratio, where two fingerprints do not match for the same molecules.

from rdkit import Chem
from rdkit.Chem import AllChem
from tqdm import tqdm
import pandas as pd
import torch
import os

num_trains = 10000 # should be 3378606, subsampled for approximation
useChirality = True # False

SMILES_PATH = 'pcqm4m-v2/raw/data.csv.gz'
SDF_PATH = 'test_download/pcqm4m-v2-train.sdf'

df = pd.read_csv(SMILES_PATH)
smiles_list = df['smiles'].tolist()[:num_trains]
idx_list = df['idx'].tolist()[:num_trains]
suppl = Chem.SDMolSupplier(SDF_PATH)

count = 0

for idx, smiles in tqdm(zip(idx_list, smiles_list), total = len(smiles_list)):
    # from SMILES
    m = Chem.MolFromSmiles(smiles)
    num_atoms_smiles = m.GetNumAtoms()
    fp_smiles = torch.tensor(list(AllChem.GetMorganFingerprintAsBitVect(m,3,useChirality=useChirality)), dtype=torch.int8)
    # from SDF
    mol = next(suppl)
    num_atoms_sdf = mol.GetNumAtoms()
    fp_sdf = torch.tensor(list(AllChem.GetMorganFingerprintAsBitVect(mol,3,useChirality=useChirality)), dtype=torch.int8)

    if not torch.all(fp_smiles == fp_sdf):
        print(f'{idx}-th moleclue mismatched')
        print()
        count += 1

nonmatch_ratio = 100 * float(count) / len(smiles_list)
print(mismatch_ratio)

When useChirality=False, the mismatch ratio was 0%, but when useChirality=True, the mismatch ratio became 0.61%, but not as high as 40% as @PierreHao reported. Did I miss anything? Of course, there may be some discrepancy between our provided molecular graphs and the morgan fingerprints that I used above.

@nakatamaho Do we know why we have many more mismatches when we include the chirality?

PierreHao commented 1 year ago

@weihua916 , for example, I have checked the chirality of smiles and sdf with index 0 to 19, the different molecular id is [0,6,11,12,15,17,18],ratio is 7/20. My method : pred1 = model(data1) pred2 = model(data2), then sum(torch.abs(pred1-pred2) > 0.0001), with this , i have got 100w ids.

weihua916 commented 1 year ago

Interesting, thanks for letting me know.

I have checked the chirality of smiles and sdf with index 0 to 19, the different molecular id is [0,6,11,12,15,17,18],ratio is 7/20.

Can you paste the code for you to do this?

@nakatamaho I am curious whether the chirality affects the HOMO-LUMO gap in chemistry. Basically, does chirality matter in our prediction task? Also, can SMILES represent chirality?

nakatamaho commented 1 year ago

@PierreHao and @weihua916, sorry for being late. The 40% difference is too significant, and I guess @weihua916 's comparison looks correct. Unfortunately, I don't know why there are still differences. I obtained isomeric SMILES using xyz, and got MOL using xyz file using Open Babel. Both Isomeric SMILES and MOL include chirality centers. Thus if we see the XYZ files where stereo chemistries are different, we may understand the discrepancy.

@weihua916 I am curious whether the chirality affects the HOMO-LUMO gap in chemistry. Basically, does chirality matter in our prediction task? Also, can SMILES represent chirality?

For the first question: The chirality of the molecule may affect the HOMO-LUMO gap. For example, if two molecules have one chirality center, L and R, there's no difference in the HOMO-LUMO gap. This is because the mirror image of L is R, and of R is L.

It can be different if a molecule has two or more chirality centers. For two chirality center cases, there are four possibilities: (L, L), (L, R), (R, L), (R, R). (L, L) and (R, R) should have the same HOMO-LUMO gap, and (L, R) and (R, L) should also have the same gap; since the mirror image of (L, L) is (R, R). However, the HOMO-LUMO gaps of (R, R) and (R, L) differ because they do not mirror images of each other.

For the second question: Basically, does chirality matter in our prediction task? Right. It matters.

For the third question: Also, can SMILES represent chirality? Yes, I use Isomeric SMILES which can represent chirality.

PierreHao commented 1 year ago

Interesting, thanks for letting me know.

I have checked the chirality of smiles and sdf with index 0 to 19, the different molecular id is [0,6,11,12,15,17,18],ratio is 7/20.

Can you paste the code for you to do this?

@nakatamaho I am curious whether the chirality affects the HOMO-LUMO gap in chemistry. Basically, does chirality matter in our prediction task? Also, can SMILES represent chirality?

Thank you for your answer @nakatamaho @weihua916 . My code is too long to paste. Maybe my comparison is rough (only compare the predictions of inference), and do not take into account (L)=(R) (L,L)=(R,R) and (L,R)=(R,L)