RXNMapper: Unsupervised attention-guided atom-mapping. Code complementing our Science Advances publication on "Extraction of organic chemistry grammar from unsupervised learning of chemical reactions" (https://advances.sciencemag.org/content/7/15/eabe4166).
Rely on rxn-chemutils, which makes a few of the points below possible.
Support of actual fragment bonds. Before this PR, ~ was sometimes accepted, but only when such compounds were parsable with RDKit. Things like [Na+]~[H-] were raising an Invalid Valence error. This PR fixes this.
Improved support of extended SMILES notation. There was some code to handle reaction SMILES with |f:0.1,2.3|, but such examples usually failed. This now works, and the mapped reaction SMILES will also adopt this format if the input was in this format.
Specifying canonicalize_rxns=False will now do as few changes to the input SMILES as possible. I.e., it will also keep the original ordering of the atoms more consistently (but not in 100% of cases).
Specifying canonicalize_rxns=False will now not fail for SMILES with invalid valence, such as CFC.
Additional tests.
Other minor fixes.
In general, the ordering of the compounds in the mapped reaction SMILES may change compared to before this PR (with identical confidence).
Sometimes, also the order of the fragments changes (see example below) - in this case, the input to the actual transformer model will be different, and hence also the output / confidences may be different.
Differences on a subset of 10 reactions from USPTO
Before the PR, this failed because of the reaction containing [H-]~[Na+]. After the PR, all is successful.
On the nine remaining actions, there are some differences in the predicted confidences.
This is because some of the fragments are ordered differently in the final reaction SMILES after canonicalization.
This is because the PR does the canonicalization on the species containing the dot, while it did so using the tilde before the PR:
(i.e. the 5th, 6th, 7th are different - due to the ordering of fragments mentioned above).
If the ~ are replaced by . for those nine reactions, the results from before and after the PR have an equivalent mapping and identical confidences (but the ordering of the mapped rxn may be different).
Main changes:
rxn-chemutils
, which makes a few of the points below possible.~
was sometimes accepted, but only when such compounds were parsable with RDKit. Things like[Na+]~[H-]
were raising an Invalid Valence error. This PR fixes this.|f:0.1,2.3|
, but such examples usually failed. This now works, and the mapped reaction SMILES will also adopt this format if the input was in this format.canonicalize_rxns=False
will now do as few changes to the input SMILES as possible. I.e., it will also keep the original ordering of the atoms more consistently (but not in 100% of cases).canonicalize_rxns=False
will now not fail for SMILES with invalid valence, such asCFC
.In general, the ordering of the compounds in the mapped reaction SMILES may change compared to before this PR (with identical confidence). Sometimes, also the order of the fragments changes (see example below) - in this case, the input to the actual transformer model will be different, and hence also the output / confidences may be different.
Differences on a subset of 10 reactions from USPTO
As an example, taking the 10 first reactions from the test set from https://github.com/rxn4chemistry/OpenNMT-py/tree/carbohydrate_transformer/data/uspto_dataset:
Before the PR, this failed because of the reaction containing
[H-]~[Na+]
. After the PR, all is successful.On the nine remaining actions, there are some differences in the predicted confidences. This is because some of the fragments are ordered differently in the final reaction SMILES after canonicalization. This is because the PR does the canonicalization on the species containing the dot, while it did so using the tilde before the PR:
The effect on the nine remaining reactions is the following:
process_reaction_with_product_maps_atoms
.(i.e. the 5th, 6th, 7th are different - due to the ordering of fragments mentioned above).
If the
~
are replaced by.
for those nine reactions, the results from before and after the PR have an equivalent mapping and identical confidences (but the ordering of the mapped rxn may be different).