open-reaction-database / ord-schema

Schema for the Open Reaction Database
https://open-reaction-database.org
Apache License 2.0
95 stars 27 forks source link

Allow arbitrary compound data #478

Closed connorcoley closed 4 years ago

connorcoley commented 4 years ago

Resolves #466

In addition to allowing arbitrary data to be defined for products (as suggested in issue #466), I thought it would be useful to also allow arbitrary data for compounds. I could imagine attaching the results of some QC to a starting material. For me, the most natural place to add a map<string, Data> field is to the Compound message so it can do double duty for starting materials and product compounds.

codecov[bot] commented 4 years ago

Codecov Report

Merging #478 into main will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #478   +/-   ##
=======================================
  Coverage   75.94%   75.94%           
=======================================
  Files          22       22           
  Lines        2029     2029           
  Branches      459      459           
=======================================
  Hits         1541     1541           
  Misses        347      347           
  Partials      141      141           
connorcoley commented 4 years ago

I did think about that. My inclination is that it would be worse to have a Data map for Compounds and a Data map for ReactionProducts. I'm not strictly opposed to having it in both places. If we do that, then we're basically using Compound product solely for its repeated CompoundIdentifier identifiers and amount fields?

Re: compounds, a separate idea is that we could potentially remove Features from Compounds now since those could be considered a subset of Data.

michaelmaser commented 4 years ago

Is product-level data not already in multiple places? The ReactionProduct has Percentage yield, its Compound has amount and identifiers, and the outcomes.analyses has processed_data.

Think I added this on #466, but I think it would be nice to have Data as part of ReactionProduct. Not sure what that should do to the other messages.

skearnes commented 4 years ago

This raises some interesting points. What's your reaction to the following?

Essentially this makes Compound and ReactionProduct counterparts for inputs and outputs, respectively.

connorcoley commented 4 years ago

Remove Compound from ReactionProduct and replace it with repeated CompoundIdentifier and amount

Removing the Compound from ReactionProduct will break backwards compatibility and require us to revise all of the current examples and datasets. It will also make require some substantial changes to the editor in terms of how the Compound fieldset is reused in multiple places. I'm hesitant to make that change.

skearnes commented 4 years ago

Removing the Compound from ReactionProduct will break backwards compatibility and require us to revise all of the current examples and datasets. It will also make require some substantial changes to the editor in terms of how the Compound fieldset is reused in multiple places. I'm hesitant to make that change.

Understood. It would be a pain in the short-term. But now is the time to make this change if we think it would dramatically improve usability of the schema. If the long-term benefit seems modest, we can skip it.

skearnes commented 4 years ago

As I think more about this specific question, I'm leaning toward not trying to rework the usage of Compound. I'm still in favor of adding a Data map to ReactionProduct so that product-level info is grouped at the same level.

Still, let's keep in mind that alpha testing is likely our last chance to make larger-scale, disruptive changes to the schema if we find compelling reasons to do so based on user feedback.

connorcoley commented 4 years ago

As I think more about this specific question, I'm leaning toward not trying to rework the usage of Compound. I'm still in favor of adding a Data map to ReactionProduct so that product-level info is grouped at the same level.

Still, let's keep in mind that alpha testing is likely our last chance to make larger-scale, disruptive changes to the schema if we find compelling reasons to do so based on user feedback.

I'll add that field to this PR.

And yes, point taken. @michaelmaser any opinion about the idea of replacing Compound compound in the ReactionProduct message with separate constituent fields identifiers and amount?

michaelmaser commented 4 years ago

And yes, point taken. @michaelmaser any opinion about the idea of replacing Compound compound in the ReactionProduct message with separate constituent fields identifiers and amount?

Yes, I like this. I agree would avoid re-working all of Compound usage, which works well for the rest of reaction components.

I have been thinking about and am still wondering if/what sort of specific data fields should be included in ReactionProduct based on how they will be accessed downstream. Right now there is Percentage compound_yield, Percentage purity, and Selectivity selectivity. I like the addition of identifiers and amount.

If arbitrary Data maps are added, they would be directly linked to the product species, but the data would still have to be called by assigned name, right? I'm wondering if there is value in defining any structured/specific data fields within the product message to hopefully enforce consistency between submissions, like a peak_area_count message with enumerated methods, for example. Thinking mostly of HTE examples that commonly seem to report peak areas only...so if these are only recorded in arbitrary analyses, then there's no quantitative data linked to the product species. Thoughts? Could be a separate issue.

Aside; looking at the ReactionProduct message, I am realizing Selectivity selectivity might possibly need to be a repeated field...since a product could, e.g., be formed both enantio- and diastereoslectively. Thoughts?

skearnes commented 4 years ago

@michaelmaser Just to clarify, I think that if we avoid reworking Compound we will keep identifiers and amount in Compound rather than adding them to ReactionProduct. Does that make sense?

I like the idea of adding some additional specific fields to ReactionProduct to encourage consistency across submissions.