hwwang55 / MolR

Chemical-Reaction-Aware Molecule Representation Learning
MIT License
75 stars 21 forks source link

AUC decreases A LOT after re-generating cached data #3

Open FeiGSSS opened 2 years ago

FeiGSSS commented 2 years ago

Hi, When I remove the cached data you provided from property prediction datasets , and generate them by myself using your codes, the AUC of property prediction decreases a lot. On the other hand, when I use the cached data you provided, the reported results can be reproduced. I've checked that the required version of pysmiles and networkx are used.

FeiGSSS commented 2 years ago

Specifically, I found the inconsistency, i.e., the node features in cached data provided are not aligned with the feature_encoder: For instance, as shown below, the charge attributes of nodes in the first DGL graph of BBBP dataset are all 22.

In [1]: import dgl

In [2]: bbbp = dgl.load_graphs("./BBBP.bin")[0]

In [3]: bbbp[0].ndata["feature"]
Out[3]: 
tensor([[ 8, 22, 27, 30],
        [17, 22, 27, 33],
        [17, 22, 27, 31],
        [17, 22, 27, 33],
        [14, 22, 27, 31],
        [17, 22, 27, 32],
        [17, 22, 27, 31],
        [ 6, 22, 27, 31],
        [17, 22, 27, 32],
        [ 6, 22, 27, 30],
        [17, 22, 28, 30],
        [17, 22, 28, 31],
        [17, 22, 28, 31],
        [17, 22, 28, 31],
        [17, 22, 28, 30],
        [17, 22, 28, 31],
        [17, 22, 28, 31],
        [17, 22, 28, 31],
        [17, 22, 28, 31],
        [17, 22, 28, 30]])

However, when I load the feature_encoder saved in the pertained model, such as gcn_1024/feature_enc.pkl, gives:

In [6]: with open("../../saved/gcn_1024/feature_enc.pkl", "rb") as f:
   ...:     feature_encoder = pkl.load(f)
   ...: 

In [7]: feature_encoder
Out[7]: 
{'element': {'Li': 0,
  'Mn': 1,
  'O': 2,
  'Zr': 3,
  'Cl': 4,
  'Na': 5,
  'In': 6,
  'Cu': 7,
  'Sb': 8,
  'Pb': 9,
  'F': 10,
  'K': 11,
  'B': 12,
  'Ge': 13,
  'N': 14,
  'Hg': 15,
  'As': 16,
  'Zn': 17,
  'Ru': 18,
  'Mg': 19,
  'Si': 20,
  'S': 21,
  'Cr': 22,
  'Sn': 23,
  'P': 24,
  'Ta': 25,
  'C': 26,
  'Bi': 27,
  'Pt': 28,
  'Cd': 29,
  'Ti': 30,
  'Xe': 31,
  'Al': 32,
  'Br': 33,
  'Se': 34,
  'Ga': 35,
  'Ag': 36,
  'I': 37,
  'unknown': 38},
 'charge': {0: 39, 1: 40, 2: 41, 3: 42, 4: 43, -1: 44, 'unknown': 45},
 'aromatic': {False: 46, True: 47, 'unknown': 48},
 'hcount': {0: 49, 1: 50, 2: 51, 3: 52, 4: 53, 'unknown': 54}}

the value of charge attribute starts from 39 (i.e., with this encoder, the node features of BBBP above are all in the range of elements).
I think this is why the AUC decreases a lot after I regenerate the node features of BBBP dataset. Actually, the node feature matrix generated using the above feature_encoder is:

In [4]: bbbp[0].ndata["feature"]
Out[4]: 
tensor([[ 4, 39, 46, 49],
        [26, 39, 46, 52],
        [26, 39, 46, 50],
        [26, 39, 46, 52],
        [14, 39, 46, 50],
        [26, 39, 46, 51],
        [26, 39, 46, 50],
        [ 2, 39, 46, 50],
        [26, 39, 46, 51],
        [ 2, 39, 46, 49],
        [26, 39, 47, 49],
        [26, 39, 47, 50],
        [26, 39, 47, 50],
        [26, 39, 47, 50],
        [26, 39, 47, 49],
        [26, 39, 47, 50],
        [26, 39, 47, 50],
        [26, 39, 47, 50],
        [26, 39, 47, 50],
        [26, 39, 47, 49]])