Closed j-adamczyk closed 2 years ago
It is one-hot encoded. Atom-encoder uses torch.nn.Embedding
module, which assumes input to be one-hot encoded (e.g., atom type)
How can they be one-hot encoded, when we have 9 features? "Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not". I see that first feature is atom number, which is an integer, not a one-hot encoded value. So the model cannot use the raw matrix x
, since it's not one-hot encoded, and AtomEncoder
outputs dense vectors with given dimensionality. While using TUDatasets, on the other hand, the x
matrix would indeed be a huge sparse matrix with one-hot encoded atom types. For OGB I would have to do such encoding myself, using the first column from the 9 features?
Each of 9-dimensional features is an index. Each of the element is embedded independently: https://github.com/snap-stanford/ogb/blob/master/ogb/graphproppred/mol_encoder.py#L22
Why use all embeddings for each feature of the atom/bond? Especially for the cases with true false embeddings where you train an emb_num params for only two classes. Which advantage has to use this than a one-hot encoding of everything + MLP to match the emb_num?
GIN assumes that features are one-hot encoded, since summation is injective: "In the first iteration, we do not need MLPs before summation if input features are one-hot encodings as their summation alone is injective".
Features in graph classification example with GIN are not one-hot encoded, however, while at least some of them easily could be (e.g. atom type). Is this on purpose, and GIN does not show any degraded performance? Also for this reason, shouldn't
AtomEncoder
actually be harmful for GIN performance due to loss of injectivity?