hwwang55 / MolR

Chemical-Reaction-Aware Molecule Representation Learning
MIT License
75 stars 21 forks source link

Why replace [H] with '' in downstream datasets? #2

Closed FeiGSSS closed 2 years ago

FeiGSSS commented 2 years ago

Hi, Thanks for your contribution! I'm new to MRL and wondering why you replace [H] and ([H]) in the SMILES of downstream dataset while not in the pretrian dataset?

hwwang55 commented 2 years ago

Hi there, thanks for your interest in our work! Actually this has been briefly explained in https://github.com/hwwang55/MolR/blob/master/src/property_pred/pp_data_processing.py line44. The reason is that, if you have a [H] or ([H]) in a SMILES string, PySmiles will treat this hydrogen atom as an explicit node in the NetworkX graph first and assign it a integer index, then remove this node from the graph since we specified that explicit_hydrogen=False in read_smiles(). This will make the node indices of final graph discontinuous (e.g., nodes indices are 0, 1, 2, 3, 5 while 4 is removed). NetworkX is fine with discontinuous node indices, but this will be a problem for DGL, since DGL assumes that node indices are continuous (starting from 0) and it assumes that your graph has an isolated node with index of 4. So, in DGL the size of this graph is 6 but actually it should be 5. This is why [H] or ([H]) should be removed.

I didn't remove [H] or ([H]) in the pretraining dataset because USPTO does not contain such representation.

FeiGSSS commented 2 years ago

Hi there, thanks for your interest in our work! Actually this has been briefly explained in https://github.com/hwwang55/MolR/blob/master/src/property_pred/pp_data_processing.py line44. The reason is that, if you have a [H] or ([H]) in a SMILES string, PySmiles will treat this hydrogen atom as an explicit node in the NetworkX graph first and assign it a integer index, then remove this node from the graph since we specified that explicit_hydrogen=False in read_smiles(). This will make the node indices of final graph discontinuous (e.g., nodes indices are 0, 1, 2, 3, 5 while 4 is removed). NetworkX is fine with discontinuous node indices, but this will be a problem for DGL, since DGL assumes that node indices are continuous (starting from 0) and it assumes that your graph has an isolated node with index of 4. So, in DGL the size of this graph is 6 but actually it should be 5. This is why [H] or ([H]) should be removed.

I didn't remove [H] or ([H]) in the pretraining dataset because USPTO does not contain such representation.

Thanks a lot. The "discontinuous" problem of networkx graph can be solved by the nx.relabel_nodes() function, which can manually assign a "continuous" node labeling to a nx.graph. But it seems unnecessary in this project cause there is no isolated nodes in pretraining dataset.