Some Questions about amap

zengkaipeng commented 7 months ago

Could you tell me how do you determine the map number of each atom(like which one should be the first and which on should be the last)? The provided data seems to have different order of atom map number compared with other baselines using USPTO-50K for training like https://github.com/uta-smile/RetroXpert. I want to know if there is any informaction leak about the atom map number if i use the provided atom map order to train model like transformer, which is not permutation-invarient about the given atoms

zengkaipeng commented 7 months ago

Thanks in advance!

zengkaipeng commented 7 months ago

I just find that using the raw data for training transformer without canonicalizing the SMILES will have impressive preformance, but i can not find out the problem

vsomnath commented 7 months ago

The products are first canonicalized (following the scipt here), and the atom mapping is set to the canonical order, and the reactants are remapped post this step.

The information leakage is not a problem, since I tested the edit prediction performance with the old (data with leakage), and new (post canonicalization) and they were the same. The predicted edits should be invariant to the order, which the tests confirmed.

In the synthon completion step though, how you do the atom mapping determines what order the fragments are generated in, and you should try to maintain the same atom mapping routine for training and testing.

vsomnath / graphretro

Some Questions about amap #11