rxn4chemistry / rxnmapper

RXNMapper: Unsupervised attention-guided atom-mapping. Code complementing our Science Advances publication on "Extraction of organic chemistry grammar from unsupervised learning of chemical reactions" (https://advances.sciencemag.org/content/7/15/eabe4166).
http://rxnmapper.ai
MIT License
286 stars 68 forks source link

Bug fixes and improvements #24

Closed avaucher closed 2 years ago

avaucher commented 2 years ago

Main changes:

In general, the ordering of the compounds in the mapped reaction SMILES may change compared to before this PR (with identical confidence). Sometimes, also the order of the fragments changes (see example below) - in this case, the input to the actual transformer model will be different, and hence also the output / confidences may be different.

Differences on a subset of 10 reactions from USPTO

As an example, taking the 10 first reactions from the test set from https://github.com/rxn4chemistry/OpenNMT-py/tree/carbohydrate_transformer/data/uspto_dataset:

rxn
CCCI.CN(C)C=O.O=C([O-])[O-]~[K+]~[K+].c1ccc2[nH]cnc2c1>>CCCn1cnc2ccccc21
CC(C)(C)P(Cl)C(C)(C)C.ClCCl.O>>CC(C)(C)[PH](=O)C(C)(C)C
C1CN2CCN1CC2.CC(C)=O.ClCBr>>ClC[N+]12CCN(CC1)CC2
CC(C)(C)[O-]~[Na+].CCn1c(Br)nc2ccccc21.CNc1ccccn1.Cc1ccccc1>>CCn1c(N(C)c2ccccn2)nc2ccccc21
CS(C)=O.Cc1cccc(CBr)n1.O.[K+]~[OH-].c1ccc(Nc2ccccn2)nc1>>Cc1cccc(CN(c2ccccn2)c2ccccn2)n1
Brc1ccccn1.Cc1cccc(Nc2cccc3ccc(C)nc23)n1.O.O=C([O-])[O-]~[Na+]~[Na+].[Br-]~[K+].[Cu]>>Cc1cccc(N(c2ccccn2)c2cccc3ccc(C)nc23)n1
CC(=O)Oc1ccc(-c2cnc(N(S(=O)(=O)c3ccc([N+](=O)[O-])cc3)S(=O)(=O)c3ccc([N+](=O)[O-])cc3)c(Cc3ccccc3)n2)cc1.CO.[Na+]~[OH-]>>O=[N+]([O-])c1ccc(S(=O)(=O)Nc2ncc(-c3ccc(O)cc3)nc2Cc2ccccc2)cc1
CCOC(=O)CCCCCBr.CN1CCc2c([nH]c3ccccc23)C1.[H-]~[Na+]>>CCOC(=O)CCCCCn1c2c(c3ccccc31)CCN(C)C2
CCOC(C)=O.CN(C)C=O.CO.COC(=O)CCn1c2ccccc2c2ccccc21.C[O-]~[Na+].Cl~NO.O.O=C([O-])O~[Na+]>>O=C(CCn1c2ccccc2c2ccccc21)NO
CC#N.Nc1ccc(C(F)(C(F)(F)F)C(F)(F)C(F)(F)F)cc1.O=C1CCC(=O)N1Cl>>Nc1ccc(C(F)(C(F)(F)F)C(F)(F)C(F)(F)F)cc1Cl

Before the PR, this failed because of the reaction containing [H-]~[Na+]. After the PR, all is successful.

On the nine remaining actions, there are some differences in the predicted confidences. This is because some of the fragments are ordered differently in the final reaction SMILES after canonicalization. This is because the PR does the canonicalization on the species containing the dot, while it did so using the tilde before the PR:

>>> MolToSmiles(MolFromSmiles('[OH-]~[K+]'))
'[OH-]~[K+]'
>>> MolToSmiles(MolFromSmiles('[OH-].[K+]'))
'[K+].[OH-]'

The effect on the nine remaining reactions is the following:

  1. Mapped RXN SMILES: all are equivalent according to process_reaction_with_product_maps_atoms.
  2. Confidences (left: old, right: new):
    0.9785982166 0.9785982166
    0.2140628880 0.2140628880
    0.3482984899 0.3482984899
    0.9714398526 0.9714398526
    0.7309175528 0.6944778058
    0.9868994112 0.9866684536
    0.2510597188 0.2471753442
    0.8377841303 0.8302324685
    0.0532848520 0.0532848520

    (i.e. the 5th, 6th, 7th are different - due to the ordering of fragments mentioned above).

If the ~ are replaced by . for those nine reactions, the results from before and after the PR have an equivalent mapping and identical confidences (but the ordering of the mapped rxn may be different).