Duplicated atom mapping after "canonicalize_prod.py"

otori-bird commented 2 years ago

i followed the readme and setup the environment successfully, and i wanted to apply the code to my own dataset. However some errors occured for the preprocessing process.

One of my reactions with the atom mapping is "[F:1][c:2]1c:3[cH:4]c:5 c:6[cH:7]1.[NH2:13][c:14]1[s:15][cH:16][cH:17][c:18]1[C:19]#[N:20]>>[c:2]1([NH:13][c:14]2[s:15][cH:16][cH:17][c:18]2[C:19]#[N:20])c:3[cH:4]c:5 c:6[cH:7]1". After running "canonicalize_prod.py", the remapped reaction is "[F:1][c:9]1[cH:10]c:11 c:13[cH:15][c:16]1N+:17[O-:19].[N:1]#[C:2][c:3]1[cH:4][cH:5][s:6][c:7]1[NH2:8]>>[N:1]#[C:2][c:3]1[cH:4][cH:5][s:6][c:7]1[NH:8][c:9]1[cH:10]c:11 c:13[cH:15][c:16]1N+:17[O-:19]", where there are two atoms with the same atom mapping in the reactants. And then i inputted it to the "parse_info,py", it was failed to be extract reaction info. I guess it was caused by the duplicated atom mapping.

There are a lot of similar errors in my own dataset, and if i skip the "canonicalize_prod.py", most of them could be processed normally by "parse_info.py".

could you fix this error? or could i skip the "canonicalize_prod.py" without affecting the performance?

Any help would be greatly appreciated.

vsomnath commented 2 years ago

Thank you for providing that example, it makes sense - I didn't check for duplicates after remapping (didn't need to with the USPTO-50k), and a fix is needed to account for this, and remap the duplicates differently. The fix itself should not take time, but I am time squeezed for the next couple of weeks.

The edit prediction step should be invariant to the application of canonicalize_prod.py. The order of fragments however is dependent on canonicalization, which influences leaving group completion performance especially if you have no canonicalization during training and canonicalization during testing (the latter is standard practice during evaluation).

You can skip the canonicalize_prod.py step if you don't change the order across training and testing, I think.

otori-bird commented 2 years ago

Thank you for providing that example, it makes sense - I didn't check for duplicates after remapping (didn't need to with the USPTO-50k), and a fix is needed to account for this, and remap the duplicates differently. The fix itself should not take time, but I am time squeezed for the next couple of weeks.

The edit prediction step should be invariant to the application of canonicalize_prod.py. The order of fragments however is dependent on canonicalization, which influences leaving group completion performance especially if you have no canonicalization during training and canonicalization during testing (the latter is standard practice during evaluation).

You can skip the canonicalize_prod.py step if you don't change the order across training and testing, I think.

Thanks a lot for your quick reply.

I'll try skipping the 'canonicalize_prod.py' and see how it works. Looking forward to your fix.

vsomnath commented 2 years ago

I tried to write a fix for the code, but the example you provided fails to give me a molecule with the following error - SMILES Parse Error: unclosed ring for input: '[c:2]1([NH:13][c:14]2[s:15][cH:16][cH:17][c:18]2[C:19]#[N:20])c:3[cH:4]c:5c:6[cH:7]1'

Can you provide a different example so I can check for correctness?

Edit: Script for reference

from rdkit import Chem
rxn_smi = "[F:1][c:2]1c:3[cH:4]c:5c:6[cH:7]1.[NH2:13][c:14]1[s:15][cH:16][cH:17][c:18]1[C:19]#[N:20]>>[c:2]1([NH:13][c:14]2[s:15][cH:16][cH:17][c:18]2[C:19]#[N:20])c:3[cH:4]c:5c:6[cH:7]1"
r, p = rxn_smi.split(">>")
pmol = Chem.MolFromSmiles(p)

otori-bird commented 2 years ago

I tried to write a fix for the code, but the example you provided fails to give me a molecule with the following error - SMILES Parse Error: unclosed ring for input: '[c:2]1([NH:13][c:14]2[s:15][cH:16][cH:17][c:18]2[C:19]#[N:20])c:3[cH:4]c:5c:6[cH:7]1'

Can you provide a different example so I can check for correctness?

Edit: Script for reference
from rdkit import Chem
rxn_smi = "[F:1][c:2]1c:3[cH:4]c:5c:6[cH:7]1.[NH2:13][c:14]1[s:15][cH:16][cH:17][c:18]1[C:19]#[N:20]>>[c:2]1([NH:13][c:14]2[s:15][cH:16][cH:17][c:18]2[C:19]#[N:20])c:3[cH:4]c:5c:6[cH:7]1"
r, p = rxn_smi.split(">>")
pmol = Chem.MolFromSmiles(p)

oh, it seems that my reaciton was parsed incorrectly by the markdown syntax. sorry for not discovering it earlier. The plain text of the reaction is

[F:1][c:2]1[c:3]([N+:10](=[O:11])[O-:12])[cH:4][c:5]([F:9])[c:6]([F:8])[cH:7]1.[NH2:13][c:14]1[s:15][cH:16][cH:17][c:18]1[C:19]#[N:20]>>[c:2]1([NH:13][c:14]2[s:15][cH:16][cH:17][c:18]2[C:19]#[N:20])[c:3]([N+:10](=[O:11])[O-:12])[cH:4][c:5]([F:9])[c:6]([F:8])[cH:7]1

vsomnath commented 2 years ago

Perfect, thank you. Let me try with this!

vsomnath commented 2 years ago

I have pushed a fix, please try it out and let me know if there are any issues!

otori-bird commented 2 years ago

I have pushed a fix, please try it out and let me know if there are any issues!

thanks a lot! i will try it and give feedback soon

vsomnath commented 2 years ago

Sorry, there was an issue with variable naming, it is fixed now

vsomnath commented 2 years ago

Closing this issue now, feel free to reopen if necessary!

vsomnath / graphretro

Duplicated atom mapping after "canonicalize_prod.py" #2