Open FeiGSSS opened 2 years ago
Specifically, I found the inconsistency, i.e., the node features in cached data provided are not aligned with the feature_encoder
:
For instance, as shown below, the charge
attributes of nodes in the first DGL graph of BBBP dataset are all 22
.
In [1]: import dgl
In [2]: bbbp = dgl.load_graphs("./BBBP.bin")[0]
In [3]: bbbp[0].ndata["feature"]
Out[3]:
tensor([[ 8, 22, 27, 30],
[17, 22, 27, 33],
[17, 22, 27, 31],
[17, 22, 27, 33],
[14, 22, 27, 31],
[17, 22, 27, 32],
[17, 22, 27, 31],
[ 6, 22, 27, 31],
[17, 22, 27, 32],
[ 6, 22, 27, 30],
[17, 22, 28, 30],
[17, 22, 28, 31],
[17, 22, 28, 31],
[17, 22, 28, 31],
[17, 22, 28, 30],
[17, 22, 28, 31],
[17, 22, 28, 31],
[17, 22, 28, 31],
[17, 22, 28, 31],
[17, 22, 28, 30]])
However, when I load the feature_encoder
saved in the pertained model, such as gcn_1024/feature_enc.pkl
, gives:
In [6]: with open("../../saved/gcn_1024/feature_enc.pkl", "rb") as f:
...: feature_encoder = pkl.load(f)
...:
In [7]: feature_encoder
Out[7]:
{'element': {'Li': 0,
'Mn': 1,
'O': 2,
'Zr': 3,
'Cl': 4,
'Na': 5,
'In': 6,
'Cu': 7,
'Sb': 8,
'Pb': 9,
'F': 10,
'K': 11,
'B': 12,
'Ge': 13,
'N': 14,
'Hg': 15,
'As': 16,
'Zn': 17,
'Ru': 18,
'Mg': 19,
'Si': 20,
'S': 21,
'Cr': 22,
'Sn': 23,
'P': 24,
'Ta': 25,
'C': 26,
'Bi': 27,
'Pt': 28,
'Cd': 29,
'Ti': 30,
'Xe': 31,
'Al': 32,
'Br': 33,
'Se': 34,
'Ga': 35,
'Ag': 36,
'I': 37,
'unknown': 38},
'charge': {0: 39, 1: 40, 2: 41, 3: 42, 4: 43, -1: 44, 'unknown': 45},
'aromatic': {False: 46, True: 47, 'unknown': 48},
'hcount': {0: 49, 1: 50, 2: 51, 3: 52, 4: 53, 'unknown': 54}}
the value of charge
attribute starts from 39
(i.e., with this encoder, the node features of BBBP above are all in the range of elements
).
I think this is why the AUC decreases a lot after I regenerate the node features of BBBP dataset. Actually, the node feature matrix generated using the above feature_encoder
is:
In [4]: bbbp[0].ndata["feature"]
Out[4]:
tensor([[ 4, 39, 46, 49],
[26, 39, 46, 52],
[26, 39, 46, 50],
[26, 39, 46, 52],
[14, 39, 46, 50],
[26, 39, 46, 51],
[26, 39, 46, 50],
[ 2, 39, 46, 50],
[26, 39, 46, 51],
[ 2, 39, 46, 49],
[26, 39, 47, 49],
[26, 39, 47, 50],
[26, 39, 47, 50],
[26, 39, 47, 50],
[26, 39, 47, 49],
[26, 39, 47, 50],
[26, 39, 47, 50],
[26, 39, 47, 50],
[26, 39, 47, 50],
[26, 39, 47, 49]])
Hi, When I remove the cached data you provided from property prediction datasets , and generate them by myself using your codes, the AUC of property prediction decreases a lot. On the other hand, when I use the cached data you provided, the reported results can be reproduced. I've checked that the required version of pysmiles and networkx are used.