open-reaction-database / ord-data

Official data repository for the Open Reaction Database
https://open-reaction-database.org
Creative Commons Attribution Share Alike 4.0 International
236 stars 60 forks source link

Follow-up on #183 #193

Open skearnes opened 4 months ago

skearnes commented 4 months ago
          > @skearnes thanks for the clarification. I thought it would be that. We will hold on that for now though. Me and @qai222 had a discussion at OH today, and there may be some changes we want to make to the dataset before pushing it to main.

I uploaded my processing scripts to this repo. A few notes from the discussion with @bdeadman and the script:

  1. The solvent.volume should have volume_includes_solutes=True specified. This field is unspecififed in the current submission.
  2. Convention needed for describing room temperature as a Temperature message.
  3. Convention needed for describing pre-mix, e.g.
    After placing the 96-well plates in ultrasonic water bath for 10 seconds to mix the reaction uniformly, 
    each 96-well plate was covered with an optical glass to minimize the volatilization of solvent and component

    (right now this is in ReactionSetup.environment).

  4. The quantification of products includes a derivatization reaction. One can include it as ReactionWorkups or referencing another Reaction record. We chose the former. This led to ambiguity in assigning reaction_role: A is added as a catalyst of the derivatization reaction, should the role of A be workup or catalyst?
  5. ReactionInput has addition_order specified, but workups is a sequence, so workup1 precedes workup2 implies workup1.input.addition_order >= workup2.input.addition_order. There should be a validation function for this.
  6. Enantiomer products: what should the product SMILES be in reaction SMILES (one without chiral or two with chiral)? how should one describe the ProductCompound? What I did is
    ProductCompound(
    identifiers=[
        CompoundIdentifier(type="SMILES", value=product_s_smi),
        CompoundIdentifier(type="SMILES", value=product_r_smi),
    ],
    is_desired_product=True,
    ...

Originally posted by @qai222 in https://github.com/open-reaction-database/ord-data/issues/183#issuecomment-2211543575

bdeadman commented 4 months ago

data-1721320121427.csv

This list of 27 reactions is everything in the ORD with an 'EE' selectivity type. I searched for 'ER' as well but the query returned no hits.

qai222 commented 4 months ago

data-1721320121427.csv

This list of 27 reactions is everything in the ORD with an 'EE' selectivity type. I searched for 'ER' as well but the query returned no hits.

Looks like all of them just have one configuration as the product... I guess it might be OK when it is obvious there is only another configuration. I think we can create a convention so the structures of all configurations are included.

bdeadman commented 3 months ago

Possible styles for reporting chiral products:

Option A - Major enantiomer only with an enantioselectivity measurement associated with the single enantiomer product. The product description only has the major enantiomer.

example 1 image

example 2 image

example 3 image

Option B - Both enantiomers as separate products. Measurements of individual enantiomers (e.g. chiral peak area) associated with a single product. Selectivity defined at the outcome analysis level (rather than product measurement), with ee also recorded as a product measurement of the major enantiomer (but not the alternative enantiomer).

enantioselective reaction option B example.pbtxt.gz This is an example from a dataset being prepared by @bdeadman based on https://doi.org/10.1021/acscatal.3c02859.

bdeadman commented 3 months ago

Both options have their merits. Option A is consistent with how such reactions are typically reported in the literature. This is also how the 27 existing chiral reactions in the ORD have been recorded.

In option B the alternative enantiomer is explicit instead of implied. This option also works well if there is measurement data of the individual enantiomer (e.g. chiral peak area) instead of just ratio/excess. For simple product mixtures the enanitoselectivity reported at outcome measurement level is unambiguous, but if the product mixture is complex (contains more than just the enantiomeric pair) then this could be ambiguous.

If I recall correctly, option B is similar to how the Organic Reactions text book reports enantioselective reactions. They also prefer to use enantiomeric ratio over enantiomeric excess.

bdeadman commented 3 months ago

data-1721320121427.csv This list of 27 reactions is everything in the ORD with an 'EE' selectivity type. I searched for 'ER' as well but the query returned no hits.

Looks like all of them just have one configuration as the product... I guess it might be OK when it is obvious there is only another configuration. I think we can create a convention so the structures of all configurations are included.

Agreed but I will discuss it at office hour tonight.

bdeadman commented 3 months ago

Room Temperature Convention ambient temperatures.xlsx

Best practice is for the experimenter to report their local room temperature with a precision value to indicate the likely range experienced by the reaction. For example this dataset from Novartis reports 22 +- 1 which was probably air conditioned.

In situations where the local room temperature is not known (e.g. data from a paper) then an estimate is needed. The USPTO dataset does not appear to have a consistent way of representing ambient temperature.

My suggestion is to treat an unreported room temperature as 25 +- 5 or 10 degrees Celcius to cover the likely range. Alternatively the precision could be artificially high (e.g. 99) to indicate the value is assumed. I would also add the temperature control type as AMBIENT, and include a details field stating "room temperature value is assumed to be 25 C".

Example message for ambiguous room temperature:

conditions { temperature { control { type: AMBIENT details: "room temperature value is assumed to be 25 C" } setpoint { value: 25.0 precision: 99.0 units: CELSIUS } } }

bdeadman commented 3 months ago

Reporting the pre-mix I would include it as a CUSTOM stirring condition

conditions { stirring { type: CUSTOM details: "After placing the 96-well plates in ultrasonic water bath for 10 seconds to mix the reaction uniformly, each 96-well plate was covered with an optical glass to minimize the volatilization of solvent and component ..." }

Reporting the atmosphere I suggest moving the vessel purge step from the vessel preparation to conditions.

conditions { pressure { control { type: AMBIENT details: "A transparent acrylic top layer was fixed to the container with 12 flange bolts, then the container was degassed with an oil pump for three times and refilled with nitrogen and the container was connected with a nitrogen balloon to ensure a nitrogen atmosphere for the reaction.\"" } setpoint { value: 1.0 units: ATMOSPHERE } atmosphere { type: NITROGEN }

bdeadman commented 3 months ago

Reporting of the Derivitization Results Summarising the discussions we have had with Connor:

bdeadman commented 3 months ago

Workup Order added as a feature request open-reaction-database/ord-schema#743

Reporting enantiomeric products in product SMILES I suggest including the main enantiomer only and/or reporting them as two separate structures. In hindsight I think including both enantiomer smiles in a single product creates ambiguity about which one it is.

bdeadman commented 3 months ago

My suggestion is to treat an unreported room temperature as 25 +- 5 or 10 degrees Celcius to cover the likely range.

From discussion with @connorcoley, @skearnes and @qai222 on 24th July 2024, we have agreed to use 22 +- 5 deg Celcius for reporting ambiguous room temperature where local knowledge is not available.

bdeadman commented 3 months ago

Reporting of the Derivitization Results

From above meeting we also determined that reactions with product characterisation by derivative can be reported as: