pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.17k stars 3.64k forks source link

The the feature dim of data.x is zero in Proteins dataset with the pyg version after 2.0.5 #5411

Closed chaojiewang94 closed 2 years ago

chaojiewang94 commented 2 years ago

🐛 Describe the bug

The main reason is in line 136 of tu_dataset.py

it is strange that the value of num_edge_attributes is larger than the feature dimension of self.data.x in proteins, which leads to the resulting dimension of self.data.x is num_nodes*0

Environment

diningphil commented 2 years ago

Second this, I just noticed the 3 atom types in PROTEINS are only added when setting use_node_attr=True in the constructor (but then we get an additional feature which was not there before). However this is not consistent with the behavior of other datasets like NCI1, where atom type is always present. This change of behavior can seriously impact the reproducibility of libraries using this dataset. Please fix it asap.

rusty1s commented 2 years ago
from torch_geometric.datasets import TUDataset

dataset = TUDataset('/tmp/ENZYMES', name='ENZYMES')
print(dataset)
print(dataset.num_features)

returns 3 for me. Can you remove the processed folder and try again?

flandolfi commented 2 years ago

It's a problem affecting specifically 'PROTEINS' ('ENZYMES' is fine):

In [1]: from torch_geometric.datasets import TUDataset

In [2]: ds = TUDataset(root='/tmp/TUDataset/', name='PROTEINS')
Downloading https://www.chrsmrrs.com/graphkerneldatasets/PROTEINS.zip
Extracting /tmp/TUDataset/PROTEINS/PROTEINS.zip
Processing...
Done!

In [3]: ds.num_node_attributes
Out[3]: 43471

In [4]: ds = TUDataset(root='/tmp/TUDataset/', name='ENZYMES')
Downloading https://www.chrsmrrs.com/graphkerneldatasets/ENZYMES.zip
Extracting /tmp/TUDataset/ENZYMES/ENZYMES.zip
Processing...
Done!

In [5]: ds.num_node_attributes
Out[5]: 18
rusty1s commented 2 years ago

Ah, I see. Sorry, not sure why I tested on ENZYMES. Your PR indeed fixes this, thanks!

21721677 commented 1 year ago

This problem has appeared again in the latest version.

from torch_geometric.datasets import TUDataset

dataset = TUDataset(root="datasets", name="PROTEINS", use_node_attr=False)
print(dataset)
print(dataset.num_node_attributes)
print(dataset.num_node_labels)
print(dataset.num_node_features)

The outputs are:

PROTEINS(1113)
43471
3
0

where the right values shoule be:

dataset.num_node_attributes=1
dataset.num_node_labels=3
dataset.num_node_features=3

In additional, all data.x become wrong:

In [ ]: dataset[0].x
Out[ ]: tensor([], size=(42, 0))
rusty1s commented 1 year ago

I cannot reproduce this on latest version. Can you remove the processed_dir and try again?

21721677 commented 1 year ago

I'm sorry. This seems to be a local problem in my own environment, as I tried the same code in other's environment and there was no problem. But I didn't find the reason for this problem, maybe there are some packages in wrong version and have conflicts with pyg.