Closed otori-bird closed 2 years ago
Thank you for providing that example, it makes sense - I didn't check for duplicates after remapping (didn't need to with the USPTO-50k), and a fix is needed to account for this, and remap the duplicates differently. The fix itself should not take time, but I am time squeezed for the next couple of weeks.
The edit prediction step should be invariant to the application of canonicalize_prod.py
.
The order of fragments however is dependent on canonicalization, which influences leaving group completion performance especially if you have no canonicalization during training and canonicalization during testing (the latter is standard practice during evaluation).
You can skip the canonicalize_prod.py
step if you don't change the order across training and testing, I think.
Thank you for providing that example, it makes sense - I didn't check for duplicates after remapping (didn't need to with the USPTO-50k), and a fix is needed to account for this, and remap the duplicates differently. The fix itself should not take time, but I am time squeezed for the next couple of weeks.
The edit prediction step should be invariant to the application of
canonicalize_prod.py
. The order of fragments however is dependent on canonicalization, which influences leaving group completion performance especially if you have no canonicalization during training and canonicalization during testing (the latter is standard practice during evaluation).You can skip the
canonicalize_prod.py
step if you don't change the order across training and testing, I think.
Thanks a lot for your quick reply.
I'll try skipping the 'canonicalize_prod.py' and see how it works. Looking forward to your fix.
I tried to write a fix for the code, but the example you provided fails to give me a molecule with the following error
- SMILES Parse Error: unclosed ring for input: '[c:2]1([NH:13][c:14]2[s:15][cH:16][cH:17][c:18]2[C:19]#[N:20])c:3[cH:4]c:5c:6[cH:7]1'
Can you provide a different example so I can check for correctness?
Edit: Script for reference
from rdkit import Chem
rxn_smi = "[F:1][c:2]1c:3[cH:4]c:5c:6[cH:7]1.[NH2:13][c:14]1[s:15][cH:16][cH:17][c:18]1[C:19]#[N:20]>>[c:2]1([NH:13][c:14]2[s:15][cH:16][cH:17][c:18]2[C:19]#[N:20])c:3[cH:4]c:5c:6[cH:7]1"
r, p = rxn_smi.split(">>")
pmol = Chem.MolFromSmiles(p)
I tried to write a fix for the code, but the example you provided fails to give me a molecule with the following error
- SMILES Parse Error: unclosed ring for input: '[c:2]1([NH:13][c:14]2[s:15][cH:16][cH:17][c:18]2[C:19]#[N:20])c:3[cH:4]c:5c:6[cH:7]1'
Can you provide a different example so I can check for correctness?
Edit: Script for reference
from rdkit import Chem rxn_smi = "[F:1][c:2]1c:3[cH:4]c:5c:6[cH:7]1.[NH2:13][c:14]1[s:15][cH:16][cH:17][c:18]1[C:19]#[N:20]>>[c:2]1([NH:13][c:14]2[s:15][cH:16][cH:17][c:18]2[C:19]#[N:20])c:3[cH:4]c:5c:6[cH:7]1" r, p = rxn_smi.split(">>") pmol = Chem.MolFromSmiles(p)
oh, it seems that my reaciton was parsed incorrectly by the markdown syntax. sorry for not discovering it earlier. The plain text of the reaction is
[F:1][c:2]1[c:3]([N+:10](=[O:11])[O-:12])[cH:4][c:5]([F:9])[c:6]([F:8])[cH:7]1.[NH2:13][c:14]1[s:15][cH:16][cH:17][c:18]1[C:19]#[N:20]>>[c:2]1([NH:13][c:14]2[s:15][cH:16][cH:17][c:18]2[C:19]#[N:20])[c:3]([N+:10](=[O:11])[O-:12])[cH:4][c:5]([F:9])[c:6]([F:8])[cH:7]1
Perfect, thank you. Let me try with this!
I have pushed a fix, please try it out and let me know if there are any issues!
I have pushed a fix, please try it out and let me know if there are any issues!
thanks a lot! i will try it and give feedback soon
Sorry, there was an issue with variable naming, it is fixed now
Closing this issue now, feel free to reopen if necessary!
i followed the readme and setup the environment successfully, and i wanted to apply the code to my own dataset. However some errors occured for the preprocessing process.
One of my reactions with the atom mapping is "[F:1][c:2]1c:3[cH:4]c:5c:6[cH:7]1.[NH2:13][c:14]1[s:15][cH:16][cH:17][c:18]1[C:19]#[N:20]>>[c:2]1([NH:13][c:14]2[s:15][cH:16][cH:17][c:18]2[C:19]#[N:20])c:3[cH:4]c:5c:6[cH:7]1". After running "canonicalize_prod.py", the remapped reaction is "[F:1][c:9]1[cH:10]c:11c:13[cH:15][c:16]1N+:17[O-:19].[N:1]#[C:2][c:3]1[cH:4][cH:5][s:6][c:7]1[NH2:8]>>[N:1]#[C:2][c:3]1[cH:4][cH:5][s:6][c:7]1[NH:8][c:9]1[cH:10]c:11c:13[cH:15][c:16]1N+:17[O-:19]", where there are two atoms with the same atom mapping in the reactants. And then i inputted it to the "parse_info,py", it was failed to be extract reaction info. I guess it was caused by the duplicated atom mapping.
There are a lot of similar errors in my own dataset, and if i skip the "canonicalize_prod.py", most of them could be processed normally by "parse_info.py".
could you fix this error? or could i skip the "canonicalize_prod.py" without affecting the performance?
Any help would be greatly appreciated.