Closed TShimko126 closed 1 year ago
The returned features are categorical. Simply converting them to floating point will wrongly treat them as numerical. What you need to do is input the features into an Embedding
layer to learn representations for each category, similar to what we do here. Alternatively, I am open to letting from_smiles
return one-hot vector representations, e.g., via from_smiles(one_hot=True)
. Let me know if you have interest in contributing this.
Hi Matthias - thanks for clarifying! Good to know about the AtomEncoder and BondEncoder from OGB. That should be sufficient for my purposes.
I've previously used the the MolGraphConvFeaturizer
from DeepChem which returns one-hot encodings. It may be something I would be open to add in the future, but unfortunately I do not have the time currently. I'll go ahead and close this issue for now.
Thanks for your help and continued work on Pytorch Geometric!
🚀 The feature, motivation and pitch
Feature
Thank you for continued development of and support for the PyG package!
I'd like to request a small change to the implementation of the
from_smiles
function intorch_geometric.utils
.Specifically, I'd like to request a change for the dtype of the
x
andedge_attr
features of the data object returned by thefrom_smiles
function from Long to Float.The relevant lines in the code base are line 96 for
x
and line 113 foredge_attr
:Motivation
Under the current implementation of the function, the node and edge features (stored in
x
andedge_attr
, respectively) are returned with dtype Long. However, most if not all of the layers that act upon these attributes (convolution, pooling, etc.) expect the Float dtype. This makes it impossible to use the direct output of this function in the forward pass of most PyG-based models.In the example below, I try to feed the output of
from_smiles
directly into a 1-layer GCN model.Example
which raises the following error:
Alternatives
Current alternative
Right now, the best alternative is to manually convert the dtype of the
x
andedge_attr
attributes of the generated data object to floats, as shown below. This can, of course, be done during initial parsing of the SMILES string, as a transform in the dataset/dataloader, or on the forward pass of the model.which returns:
Implementation alternative
One alternative, if it's desirable to keep the current behavior for reverse compatibility is to allow the user to specify the dtype for
x
andedge_attr
through adtype
keyword as shown in the condensed example below.Additional context
Thank you for considering this change and if I am missing the reason for the long return type, please do let me know.