Add field in schema and web editor for "relative yield only"

connorcoley commented 4 years ago

In some cases, yields should only be treated relative to other yields within the same dataset, e.g., the Santanilla HTE example. I believe an optional boolean field yield_is_relative or something similar is needed. This is related to issue #420 but distinct enough to warrant its own discussion.

(CC @michaelmaser @skearnes @beef-broccoli)

michaelmaser commented 4 years ago

In general, I agree this is a necessary field. Hopefully prevent practitioners mixing quantitative and relative data.

That said, I'm not actually sure this even applies to the Santanilla example... They compare raw product/internal_standard peak area ratios for relative "performance", but each reaction makes a different product. Since they didn't calibrate for each product's absorption response, I don't think they can really even say yield_is_relative except when two reactions make the same product.

I don't know if an additional yield_is_qualitative is worth it, or if these examples should just be discouraged from reporting numerical yields in outcomes. The raw/processed data can still be included, but I'm not sure how useful it even is... Suppose the qualitative observation that the reaction formed any amount of product could be valuable.

beef-broccoli commented 4 years ago

A very easy fix for this particular paper is to ask authors for the amount of internal standard used, or even just a ratio with regard to amount of substrate. I tried to figure it out based on various product/standard ratios they provided (most of time people do 1 equivalent of internal standard, or some integer number), but it seems like for each plate there's a different scale. I have no idea why they choose to report this relative conversion thing, rather than calculate an actual yield. They have to know how much internal standard they put in, and it obviously has to be consistent across the plate, otherwise the comparison is not valid.

For the purpose of the database, I think this type of results should be discouraged, unless there is a very good reason for it (why obtaining yield is not possible). For this particular paper, if we can't come up with a good solution to it maybe just leave it out. It's still good for the purpose of testing schema, but it's probably not worth the time trying to figure out these yield numbers.

michaelmaser commented 4 years ago

The authors write in the SI: "Using the Mosquito, the plate was then quenched with 3 uL of a DMSO stock solution of acetic 5% acid and biphenyl (to give 3 mol% biphenyl relative to 22), which was transferred from a 384-well source plate. ... The 384-well plate was then heat-sealed and subjected to chromatographic analysis by a Waters UPLC Instrument. The ratio of the LC area counts of product over internal standard was used to directly compare the relative performance of these reactions."

So, they are assuming 0.03 equiv internal standard (IS) to the limiting substrate. I multiplied their product/IS ratios by 0.03 to get a yield by: data['Product_yield'] = round(((data['Prod']/(data['IS']/0.03))*100), 2), and some still come out to be around 120%.

I think the problem isn't that they don't have an IS, it's that they can't make a 1:1 comparison between product & IS peak areas because they're measuring by "LC area count", which I'm assuming means either UV absorption or just LCMS peaks from ionization. If UV, different molecules will have different absorption responses at the given wavelength they're using, even at the same concentration (from Beer's law). If MS, different molecules will ionize to different extents under the given ionization method used, even at the same concentration. So, they need to measure these response factors in order to draw quantitative yields.

It is possible to compare product/IS ratios for "relative yield" between two reactions, but only if the product is the same molecule or if the ratios come from NMR or some other universal detection method. They're comparing between reactions that make different products, so I'm not sure yield_is_relative is appropriate here.

For this example ORD submission, I'm thinking only the raw peak areas should be reported with the reaction outcomes. Thoughts?

connorcoley commented 4 years ago

Discussion w/ @skearnes and @michaelmaser

Yields that are not well-quantified yields should never be in the yield field if we have that as a separate field.
We should be able to accommodate many normalized and unnormalized abundances for each species quantified in an analysis. We should be able to handle cases where yields are relative only. (this issue #432)
One option is to have a more generic message (e.g., repeated ProductMetric (name not great -- Characterization?)) that has an enum to select between yield, selectivity, purity, EIC counts, UV peak area, etc., and all of the other numbers one uses to describe a particular peak/species in an analysis. Each of these ProductMetrics would cross-reference a single analysis key. It would have a details field, too. All of these values can be written as floats. We need to be clear about how percentages are recorded so we don't mix up 0.5 and 50%. This allows product-level analytical data (#426 )
Selectivity doesn't fit as nicely into the pattern of having a single float descriptor. For defining things like ER, DR, EZ, ZE, etc., it might be easiest to have a back-up string_value field.
Each ProductMetric will have a boolean authentic_standard and internal_standard. Maybe a compound field as well for authentic_standard?
Product-specific analytical data (e.g., HRMS) also belongs in this new field, rather than the processed_data map of a ReactionAnalysis.
Color/texture also potentially belong in this new characterization field.
To accommodate values with units (e.g., melting point temperature, CD angle), we should have additional fields that will only be defined under special circumstances. For example, a Temperature temperature field that should only be used when the analysis_key points to a ReactionAnalysis with type MP. This will require significant additions to the validation functions.

While doing this change, we can revisit the question of whether it makes sense to include the whole Compound compound in a product or not. We only really use the identifiers and amount fields right now. The amount itself could be the value associated with a WEIGHT analysis in a ProductMetric field but this wouldn't have an associated units. On the whole, keeping Compound compound in the ReactionProduct seems to make sense. However, we should write new validation checks to ensure that extraneous Compound fields are not defined for product compounds.

The features field in a Compound should be broadened to be a map to a Data field, not just string and float values (#478 )

open-reaction-database / ord-schema

Add field in schema and web editor for "relative yield only" #432