Closed connorcoley closed 3 years ago
In general, I agree this is a necessary field. Hopefully prevent practitioners mixing quantitative and relative data.
That said, I'm not actually sure this even applies to the Santanilla example... They compare raw product/internal_standard peak area ratios for relative "performance", but each reaction makes a different product. Since they didn't calibrate for each product's absorption response, I don't think they can really even say yield_is_relative
except when two reactions make the same product.
I don't know if an additional yield_is_qualitative
is worth it, or if these examples should just be discouraged from reporting numerical yields in outcomes. The raw/processed data can still be included, but I'm not sure how useful it even is... Suppose the qualitative observation that the reaction formed any amount of product could be valuable.
A very easy fix for this particular paper is to ask authors for the amount of internal standard used, or even just a ratio with regard to amount of substrate. I tried to figure it out based on various product/standard ratios they provided (most of time people do 1 equivalent of internal standard, or some integer number), but it seems like for each plate there's a different scale. I have no idea why they choose to report this relative conversion thing, rather than calculate an actual yield. They have to know how much internal standard they put in, and it obviously has to be consistent across the plate, otherwise the comparison is not valid.
For the purpose of the database, I think this type of results should be discouraged, unless there is a very good reason for it (why obtaining yield is not possible). For this particular paper, if we can't come up with a good solution to it maybe just leave it out. It's still good for the purpose of testing schema, but it's probably not worth the time trying to figure out these yield numbers.
The authors write in the SI: "Using the Mosquito, the plate was then quenched with 3 uL of a DMSO stock solution of acetic 5% acid and biphenyl (to give 3 mol% biphenyl relative to 22), which was transferred from a 384-well source plate. ... The 384-well plate was then heat-sealed and subjected to chromatographic analysis by a Waters UPLC Instrument. The ratio of the LC area counts of product over internal standard was used to directly compare the relative performance of these reactions."
So, they are assuming 0.03 equiv internal standard (IS) to the limiting substrate. I multiplied their product/IS ratios by 0.03 to get a yield by: data['Product_yield'] = round(((data['Prod']/(data['IS']/0.03))*100), 2)
, and some still come out to be around 120%.
I think the problem isn't that they don't have an IS, it's that they can't make a 1:1 comparison between product & IS peak areas because they're measuring by "LC area count", which I'm assuming means either UV absorption or just LCMS peaks from ionization. If UV, different molecules will have different absorption responses at the given wavelength they're using, even at the same concentration (from Beer's law). If MS, different molecules will ionize to different extents under the given ionization method used, even at the same concentration. So, they need to measure these response factors in order to draw quantitative yields.
It is possible to compare product/IS ratios for "relative yield" between two reactions, but only if the product is the same molecule or if the ratios come from NMR or some other universal detection method. They're comparing between reactions that make different products, so I'm not sure yield_is_relative
is appropriate here.
For this example ORD submission, I'm thinking only the raw peak areas should be reported with the reaction outcomes. Thoughts?
Discussion w/ @skearnes and @michaelmaser
yield
field if we have that as a separate field.repeated ProductMetric
(name not great -- Characterization?)) that has an enum to select between yield, selectivity, purity, EIC counts, UV peak area, etc., and all of the other numbers one uses to describe a particular peak/species in an analysis. Each of these ProductMetrics would cross-reference a single analysis key. It would have a details field, too. All of these values can be written as floats. We need to be clear about how percentages are recorded so we don't mix up 0.5 and 50%. This allows product-level analytical data (#426 )string_value
field. compound
field as well for authentic_standard?Temperature temperature
field that should only be used when the analysis_key
points to a ReactionAnalysis with type MP
. This will require significant additions to the validation functions.While doing this change, we can revisit the question of whether it makes sense to include the whole Compound compound
in a product or not. We only really use the identifiers and amount fields right now. The amount itself could be the value associated with a WEIGHT analysis in a ProductMetric field but this wouldn't have an associated units. On the whole, keeping Compound compound
in the ReactionProduct seems to make sense. However, we should write new validation checks to ensure that extraneous Compound fields are not defined for product compounds.
The features
field in a Compound should be broadened to be a map to a Data field, not just string and float values (#478 )
In some cases, yields should only be treated relative to other yields within the same dataset, e.g., the Santanilla HTE example. I believe an optional boolean field
yield_is_relative
or something similar is needed. This is related to issue #420 but distinct enough to warrant its own discussion.(CC @michaelmaser @skearnes @beef-broccoli)