snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

Match enum with RDKit in the SMILES converter #472

Open Kh4L opened 5 months ago

Kh4L commented 5 months ago

Change enums to match RDKit Atom enums

HybridizationType

https://www.rdkit.org/docs/cppapi/classRDKit_1_1Atom.html#a58e40e30db6b42826243163175cac976

BondType

https://www.rdkit.org/docs/cppapi/classRDKit_1_1Bond.html#a2c93af0aeb3297ee77b6afdc27b68d6f

Kh4L commented 5 months ago

Hi! Thanks for your question. Basic question: Why is it necessary and how does this affect the model performance?

This change allow us to represent all of the possible hybridization and edgebond types.

The goal is to match the smile converter implem in PyG https://github.com/pyg-team/pytorch_geometric/blob/51c57da51d45c19497d393cc678c215eb341b3b4/torch_geometric/utils/smiles.py#L7

Also for backward compatibility, I'd like to use these additional features given some flag argument.

What kind of flag are you thinking of?

weihua916 commented 5 months ago

Got it --- matching with PyG and making the graph expressive is good, but we also don't want to disturb others who have built models on the current version, right?

Why can't people just use PyG's smiles2graph and apply it to PCQMv2 data? You can easily pass your favorite smiles2graph here.

Kh4L commented 4 months ago

Well, PyG's from_smile returns a Data graph and not a dict of np arrays, so this would not work as it is

What we could do is to check the type of graph here and convert it to a dict of arrays