open-reaction-database / ord-schema

Schema for the Open Reaction Database
https://open-reaction-database.org
Apache License 2.0
95 stars 27 forks source link

Add field in schema and web editor for "relative yield only" #432

Closed connorcoley closed 3 years ago

connorcoley commented 4 years ago

In some cases, yields should only be treated relative to other yields within the same dataset, e.g., the Santanilla HTE example. I believe an optional boolean field yield_is_relative or something similar is needed. This is related to issue #420 but distinct enough to warrant its own discussion.

(CC @michaelmaser @skearnes @beef-broccoli)

michaelmaser commented 4 years ago

In general, I agree this is a necessary field. Hopefully prevent practitioners mixing quantitative and relative data.

That said, I'm not actually sure this even applies to the Santanilla example... They compare raw product/internal_standard peak area ratios for relative "performance", but each reaction makes a different product. Since they didn't calibrate for each product's absorption response, I don't think they can really even say yield_is_relative except when two reactions make the same product.

I don't know if an additional yield_is_qualitative is worth it, or if these examples should just be discouraged from reporting numerical yields in outcomes. The raw/processed data can still be included, but I'm not sure how useful it even is... Suppose the qualitative observation that the reaction formed any amount of product could be valuable.

beef-broccoli commented 4 years ago

A very easy fix for this particular paper is to ask authors for the amount of internal standard used, or even just a ratio with regard to amount of substrate. I tried to figure it out based on various product/standard ratios they provided (most of time people do 1 equivalent of internal standard, or some integer number), but it seems like for each plate there's a different scale. I have no idea why they choose to report this relative conversion thing, rather than calculate an actual yield. They have to know how much internal standard they put in, and it obviously has to be consistent across the plate, otherwise the comparison is not valid.

For the purpose of the database, I think this type of results should be discouraged, unless there is a very good reason for it (why obtaining yield is not possible). For this particular paper, if we can't come up with a good solution to it maybe just leave it out. It's still good for the purpose of testing schema, but it's probably not worth the time trying to figure out these yield numbers.

michaelmaser commented 4 years ago

The authors write in the SI: "Using the Mosquito, the plate was then quenched with 3 uL of a DMSO stock solution of acetic 5% acid and biphenyl (to give 3 mol% biphenyl relative to 22), which was transferred from a 384-well source plate. ... The 384-well plate was then heat-sealed and subjected to chromatographic analysis by a Waters UPLC Instrument. The ratio of the LC area counts of product over internal standard was used to directly compare the relative performance of these reactions."

So, they are assuming 0.03 equiv internal standard (IS) to the limiting substrate. I multiplied their product/IS ratios by 0.03 to get a yield by: data['Product_yield'] = round(((data['Prod']/(data['IS']/0.03))*100), 2), and some still come out to be around 120%.

I think the problem isn't that they don't have an IS, it's that they can't make a 1:1 comparison between product & IS peak areas because they're measuring by "LC area count", which I'm assuming means either UV absorption or just LCMS peaks from ionization. If UV, different molecules will have different absorption responses at the given wavelength they're using, even at the same concentration (from Beer's law). If MS, different molecules will ionize to different extents under the given ionization method used, even at the same concentration. So, they need to measure these response factors in order to draw quantitative yields.

It is possible to compare product/IS ratios for "relative yield" between two reactions, but only if the product is the same molecule or if the ratios come from NMR or some other universal detection method. They're comparing between reactions that make different products, so I'm not sure yield_is_relative is appropriate here.

For this example ORD submission, I'm thinking only the raw peak areas should be reported with the reaction outcomes. Thoughts?

connorcoley commented 4 years ago

Discussion w/ @skearnes and @michaelmaser

While doing this change, we can revisit the question of whether it makes sense to include the whole Compound compound in a product or not. We only really use the identifiers and amount fields right now. The amount itself could be the value associated with a WEIGHT analysis in a ProductMetric field but this wouldn't have an associated units. On the whole, keeping Compound compound in the ReactionProduct seems to make sense. However, we should write new validation checks to ensure that extraneous Compound fields are not defined for product compounds.

The features field in a Compound should be broadened to be a map to a Data field, not just string and float values (#478 )