Questions regarding reaction roles in `ord-14091a23403d4d96bdcbd0a64f981f4d`

FanwangM commented 2 years ago

I was using ord-14091a23403d4d96bdcbd0a64f981f4d as my toy example, but I get confused. I am not sure what would be the reactants and what would be the products. As we can see from https://open-reaction-database.org/client/id/ord-14091a23403d4d96bdcbd0a64f981f4d#outcomes, the reaction is so long. The reaction should be a simple Suzuki coupling reaction as shown in Scheme 1 at https://doi-org.libproxy.mit.edu/10.1039/C9RE00086K.

My understanding is that reactants should be in inputs message and products should be in. outcomes maeesage. Now if we suppose rxn_old is my reaction message for ord-14091a23403d4d96bdcbd0a64f981f4d, I can do

# parsing reactants
for idx, (reaction_info, reaction_input) in enumerate(rxn_old.inputs.items()):
    # todo (FWMeng): make this block run faster
    # reactant SMILES processing
    reactant_smi = [message_helpers.smiles_from_compound(component) for component in
                    reaction_input.components
                    if component.reaction_role == reaction_old_pb2.ReactionRole.REACTANT]
print(reactant_smi)
# output:  [ ]

# parsing products
for idx, reaction_output in enumerate(rxn_old.outcomes):
    product_smi = [message_helpers.smiles_from_compound(product)
                   for product in reaction_output.products
                   # if product.reaction_role == reaction_pb2.ReactionRole.PRODUCT
                   if product.reaction_role == reaction_old_pb2.ReactionRole.PRODUCT
                   ]
print(product_smi)
# output:
# ['C12=CC=CC=C1C3=C(C=CC=C3)N2',
#  'COC(/C(C)=C(C1CC1)/c2cccc(OCc3ccccc3)c2)=O',
#  'COC(/C(C)=C(C1CC1)\\c2cccc(OCc3ccccc3)c2)=O',
#  'NC(C=CC=C1)=C1C2=CC=CC=C2C3=CC(OCC4=CC=CC=C4)=CC=C3',
#  'Oc1cccc(OCc2ccccc2)c1',
#  'c1(COc2cccc(c3cccc(OCc4ccccc4)c3)c2)ccccc1']

# check if reactant in outcomes message
for idx, reaction_output in enumerate(rxn_old.outcomes):
    test_smi = [message_helpers.smiles_from_compound(product)
                   for product in reaction_output.products
                   # if product.reaction_role == reaction_pb2.ReactionRole.PRODUCT
                   if product.reaction_role == reaction_old_pb2.ReactionRole.REACTANT
                   ]
print(test_smi)
# output:
# ['OB(O)c1cc(OCc2ccccc2)ccc1', 'COC(=O)/C(C)=C(/OS(=O)(=O)c1ccc(C)cc1)C1CC1']

Is this paper not properly prepared? Even the authors did measurements at different time points, I think we should have one consistent chemical reaction SMILES. From the above, we can see that reactants are not in inputs message and reactant SMILES can present in outcomes (this is normal as the reaction will take some time to finish). What should be the optimal/right way to deal with reactions like this? Thank you.

@connorcoley @skearnes

connorcoley commented 2 years ago

In this example, there are many recorded "products" with reaction roles other than PRODUCT. This is because they quantified the peak areas for many species, including leftover unreacted starting material. We would like to capture that analytical data, so those species are listed under the outcomes so we can assign peak areas to them.

From the perspective of cleaning up the data to make it ML-ready, you should consider checking the reaction roles as your snippet shows. If roles are assigned for the species in reaction_output.products, only keep the PRODUCT one(s).

connorcoley commented 2 years ago

It does look like there could be species labeled as reactants in the outcomes that don't appear in the inputs, which would indeed be odd

ipendlet commented 1 year ago

Could possibly be a problem with data, but this doesn't belong in the ord-interface issues. Closing for now.

open-reaction-database / ord-interface

Questions regarding reaction roles in `ord-14091a23403d4d96bdcbd0a64f981f4d` #64