sustainable-processes / ORDerly

Chemical reaction data & benchmarks. Extraction and cleaning of data from Open Reaction Database (ORD)
MIT License
69 stars 8 forks source link

Update logic for removing reactions with unresolvable names #78

Closed dswigh closed 1 year ago

dswigh commented 1 year ago

When adding molecules from rxn.input when Trust_labelling = False

  1. In dealing with reactions with unresolvable names, there should be 4 options (in cleaning): either 'keep', ‘remove_reactions_with_unresolvable_names’, ‘remove_reactions_with_unresolvable_names_only_if_no_rxn_str_exists_otherwise_map_to_none’, or ‘map_unresolvable_names_to_None’ (should maybe come up with a more concise name)
  2. Sometimes, the plain text in rxn.input can refer to molecules that are already present in the reaction. Eg in the case where it says “grignard reagent”, this is superfluous if there’s already a molecule containing Mg and a halogen - so in this case it's safe to just "map_unresolvable_names_to_None", but in other cases perhaps you allow problematic reactions to stay in the dataset. I think "remove_reactions_with_unresolvable_names_only_if_no_rxn_str_exists_otherwise_map_to_none" is probably a reasonable middleground
dswigh commented 1 year ago

1) Implemented 3 bools so we have different options for handling unresolvable names:

set_unresolved_names_to_none_if_mapped_rxn_str_exists_else_del_rxn: bool,

set_unresolved_names_to_none: bool,

remove_rxn_with_unresolved_names: bool,

0 or 1 of these bools can be true, the others must be false. If all of them are false, that means 'do nothing', so there are 4 different ways of handling unresolved names. The default is for "set_unresolved_names_to_none_if_mapped_rxn_str_exists_else_del_rxn" to be True, and the others False. 2) We will not be implementing any manual rules like the one mentioned for “grignard reagent” (this is a can of worms we don't want to open, however, the behaviour described should be automatically captured when "set_unresolved_names_to_none_if_mapped_rxn_str_exists_else_del_rxn" == True