sustainable-processes / ORDerly

Chemical reaction data & benchmarks. Extraction and cleaning of data from Open Reaction Database (ORD)
MIT License
61 stars 5 forks source link

Update atom mapping and reactant detection logic #21

Open dswigh opened 1 year ago

dswigh commented 1 year ago

Atom mapping in the USPTO dataset was done using Indigo over 6 years ago (https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873), and better tools for atom mapping have since been created, e.g. rxn mapper (https://onlinelibrary.wiley.com/doi/10.1002/minf.202100138). Even though rxnmapper may be better than Indigo, the benchmarking study linked above may be slightly misleading when it comes to determining how much better rxnmapper is, because the benchmarking dataset was specifically curated to include very difficult reactions. Both tools are likely to perform very well on 'easy' reactions. On a more realistic dataset that contains both easy and hard reactions, mapping performance will likely be more similar.

With a better atom mapping, it may also be possible to expand the scope of reactant detection in a reaction string, e.g. by detecting previously unmapped atoms in the product and detecting these atoms among the agents, and then moving said agents to the reactants.

Rxnmapper is quite a heavy programme, and would take many hours to run on a few million reactions. Since the gain is likely to only be marginal coupled with us wanting to keep the programme relatively light weight, we have decided to keep the original mapping in the ORD dataset (Indigo in the case of USPTO data).

dswigh commented 1 year ago

Here's an example of where the atom mapping fails: Br[CH2:2][C:3]1[CH:4]=[CH:5][C:6]2[O:15][C:10]3=[N:11][CH:12]=[CH:13][CH:14]=[C:9]3C:8[C:7]=2[CH:17]=1.[CH3:18]N:19C=O.[C-]#N.[Na+]>O>C:18#[N:19] We would expect the triple-bonded N in the product to come from the triple-bonded N in the reactant ([C-]#N). Nothing we can do about this, we are at the mercy of the existing atom-mapping in ORD. From: uspto-grants-1976_01.parquet ("ord-cc0d0a952867484fa3eb43ab33c5c8dd") index 412