snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.95k stars 402 forks source link

One-hot encoding for GIN #317

Closed j-adamczyk closed 2 years ago

j-adamczyk commented 2 years ago

GIN assumes that features are one-hot encoded, since summation is injective: "In the first iteration, we do not need MLPs before summation if input features are one-hot encodings as their summation alone is injective".

Features in graph classification example with GIN are not one-hot encoded, however, while at least some of them easily could be (e.g. atom type). Is this on purpose, and GIN does not show any degraded performance? Also for this reason, shouldn't AtomEncoder actually be harmful for GIN performance due to loss of injectivity?

weihua916 commented 2 years ago

It is one-hot encoded. Atom-encoder uses torch.nn.Embedding module, which assumes input to be one-hot encoded (e.g., atom type)

j-adamczyk commented 2 years ago

How can they be one-hot encoded, when we have 9 features? "Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not". I see that first feature is atom number, which is an integer, not a one-hot encoded value. So the model cannot use the raw matrix x, since it's not one-hot encoded, and AtomEncoder outputs dense vectors with given dimensionality. While using TUDatasets, on the other hand, the x matrix would indeed be a huge sparse matrix with one-hot encoded atom types. For OGB I would have to do such encoding myself, using the first column from the 9 features?

weihua916 commented 2 years ago

Each of 9-dimensional features is an index. Each of the element is embedded independently: https://github.com/snap-stanford/ogb/blob/master/ogb/graphproppred/mol_encoder.py#L22

alexsoleg commented 1 year ago

Why use all embeddings for each feature of the atom/bond? Especially for the cases with true false embeddings where you train an emb_num params for only two classes. Which advantage has to use this than a one-hot encoding of everything + MLP to match the emb_num?