open-reaction-database / ord-schema

Schema for the Open Reaction Database
https://open-reaction-database.org
Apache License 2.0
95 stars 27 forks source link

Product measurement/characterization revision to the schema #496

Closed connorcoley closed 3 years ago

connorcoley commented 4 years ago

Resolves #432 (where most discussion has been), resolves #386, resolves #420, resolves #426, resolves #466, resolves #506

For discussion here:

  1. Is it redundant to have both "counts" and "intensity"? I feel like the two aren't completely interchangeable, but there is certainly ambiguity.
  2. Are we being completely consistent in terms of holding product-specific analytical data in this new ProductMeasurement field? In particular, I'm thinking about IR and HRMS of an isolated compound. The entire analysis is of the presumably-isolated product. Should ProductMeasurement accommodate a structured definition of the full IR spectrum? We shied away from this idea (#48) since it got rather complicated. However, HRMS is a very easy measurement to capture in a structured format -- the dominant peak is just a float m/z. It seems silly not to include this in the ProductMeasurement field. The temperature of a MP experiment or the angle of a CD experiment are similar.
  3. To broaden the previous point, some analyses are applied to crude product mixtures, while others are applied to thought-to-be-pure compounds. in the former case, we definitely need this ProductMeasurement message to create a closer association between an analysis and a single product. But in the latter case, there is already an implicit connection. Some analyses result in a scalar value while others result in more structured data. One option is to add a Data field to the ProductMeasurement and a SPECTRUM type, but then there is ambiguity around when to use that field rather than the ReactionAnalysis's raw_data and processed_data maps. If we do add a Data field to ProductMeasurement, it would certainly be very handy for non-HTE data where people do full product characterizations; we would have access to many examples of structures and their NMR/IR with which to build predictive models.
  4. I've kept color and texture in the ReactionProduct itself. They are not supposed to be tied to a specific analysis (unless we add "VISUAL_INSPECTION" as an analysis which is silly).
  5. I don't think that uses_internal_standard and uses_authentic_standard need to be defined for the analysis, as they need to be defined for each (product, analysis) to handle the situation where multiple products are quantified and not all necessarily use an authentic or internal standard.

Note that this schema change is not backwards compatible. The old schema is included in this repo to make data transfer easier.

skearnes commented 4 years ago

Resolves #432 (where most discussion has been), resolves #386, resolves #420, resolves #426

For discussion here:

  1. Is it redundant to have both "counts" and "intensity"? I feel like the two aren't completely interchangeable, but there is certainly ambiguity.

I don't have strong feelings about this, but I think if they are not interchangeable then keeping them separate is fine.

  1. Are we being completely consistent in terms of holding product-specific analytical data in this new ProductMeasurement field? In particular, I'm thinking about IR and HRMS of an isolated compound. The entire analysis is of the presumably-isolated product. Should ProductMeasurement accommodate a structured definition of the full IR spectrum? We shied away from this idea (#48) since it got rather complicated. However, HRMS is a very easy measurement to capture in a structured format -- the dominant peak is just a float m/z. It seems silly not to include this in the ProductMeasurement field. The temperature of a MP experiment or the angle of a CD experiment are similar.

I'm not opposed to adding structured analysis messages piecemeal, especially if we can start with easy ones. We don't have to try to cover all the possibilities at once (and adding new ones to ProductMeasurement later will be backwards compatible).

  1. To broaden the previous point, some analyses are applied to crude product mixtures, while others are applied to thought-to-be-pure compounds. in the former case, we definitely need this ProductMeasurement message to create a closer association between an analysis and a single product. But in the latter case, there is already an implicit connection. Some analyses result in a scalar value while others result in more structured data. One option is to add a Data field to the ProductMeasurement and a SPECTRUM type, but then there is ambiguity around when to use that field rather than the ReactionAnalysis's raw_data and processed_data maps. If we do add a Data field to ProductMeasurement, it would certainly be very handy for non-HTE data where people do full product characterizations; we would have access to many examples of structures and their NMR/IR with which to build predictive models.

This is indeed tricky. Maybe we need to think more broadly about how the analysis/product pairs should be structured. My current mental model is that an analysis may contain data that covers one or more compounds, and ProductMeasurement is the place to store compound-specific data extracted from an analysis. So I think that structured analysis data might be best placed in the ReactionAnalysis and summary statistics (like a specific peak area or quantified yield) are best placed in ProductMeasurement.

That doesn't really resolve what to do about single-compound analyses, like melting point, where the data could be in either place?

  1. I've kept color and texture in the ReactionProduct itself. They are not supposed to be tied to a specific analysis (unless we add "VISUAL_INSPECTION" as an analysis which is silly).

SGTM.

  1. I don't think that uses_internal_standard and uses_authentic_standard need to be defined for the analysis, as they need to be defined for each (product, analysis) to handle the situation where multiple products are quantified and not all necessarily use an authentic or internal standard.

Agreed; the placement in ProductMeasurement seems appropriate.

Note that this schema change breaks many things (hence the draft PR) and is not backwards compatible

Huzzah for alpha testing!

michaelmaser commented 4 years ago
  1. Are we being completely consistent in terms of holding product-specific analytical data in this new ProductMeasurement field? In particular, I'm thinking about IR and HRMS of an isolated compound. The entire analysis is of the presumably-isolated product. Should ProductMeasurement accommodate a structured definition of the full IR spectrum? We shied away from this idea (#48) since it got rather complicated. However, HRMS is a very easy measurement to capture in a structured format -- the dominant peak is just a float m/z. It seems silly not to include this in the ProductMeasurement field. The temperature of a MP experiment or the angle of a CD experiment are similar.

    I'm not opposed to adding structured analysis messages piecemeal, especially if we can start with easy ones. We don't have to try to cover all the possibilities at once (and adding new ones to ProductMeasurement later will be backwards compatible).

For IR, a similar field to HRMS but just repeated could be sufficient. If needed, an IR spectrum in a message could be captured by a repeated float value

  1. To broaden the previous point, some analyses are applied to crude product mixtures, while others are applied to thought-to-be-pure compounds. in the former case, we definitely need this ProductMeasurement message to create a closer association between an analysis and a single product. But in the latter case, there is already an implicit connection. Some analyses result in a scalar value while others result in more structured data. One option is to add a Data field to the ProductMeasurement and a SPECTRUM type, but then there is ambiguity around when to use that field rather than the ReactionAnalysis's raw_data and processed_data maps. If we do add a Data field to ProductMeasurement, it would certainly be very handy for non-HTE data where people do full product characterizations; we would have access to many examples of structures and their NMR/IR with which to build predictive models.

A repeated float value of peak chemical shifts could almost capture an NMR spectrum too, except for 1H where string splitting information is also needed. We could try to build a single Spectrum message that could handle these, IR, and others with repeated float value and details fields, or could build specific messages for each expected spectrum type that isn't just floats

My current mental model is that an analysis may contain data that covers one or more compounds, and ProductMeasurement is the place to store compound-specific data extracted from an analysis.

I agree with this model, where ReactionAnalysis is used largely for mixtures or in-progress/crude analysis, and ProductMeasurement is used for single-component or purified/final compound analysis.

So I think that structured analysis data might be best placed in the ReactionAnalysis and summary statistics (like a specific peak area or quantified yield) are best placed in ProductMeasurement.

I think this is fine, but if "structured analysis data" is carried out on a purified compound or is being used to determine yield/identity of a product, I think I would prefer it be placed in ProductMeasurement (I guess like you're saying for "specific peak area or quantified yield").

That doesn't really resolve what to do about single-compound analyses, like melting point, where the data could be in either place?

As above, I personally think these single-compound analyses should be contained in ReactionProduct.

  1. I've kept color and texture in the ReactionProduct itself. They are not supposed to be tied to a specific analysis (unless we add "VISUAL_INSPECTION" as an analysis which is silly).

Agreed.

  1. I don't think that uses_internal_standard and uses_authentic_standard need to be defined for the analysis, as they need to be defined for each (product, analysis) to handle the situation where multiple products are quantified and not all necessarily use an authentic or internal standard.

    Agreed; the placement in ProductMeasurement seems appropriate.

Right, sounds good.

connorcoley commented 4 years ago

Discussion notes:

connorcoley commented 4 years ago

New question for discussion, relevant to the raw_data and processed_data maps of an analysis: should we add arbitrary unit-ed fields here? For example, in addition to having float_value, string_value, etc., should we have Temperature temperature_value, Mass mass_value, etc.?

skearnes commented 4 years ago

New question for discussion, relevant to the raw_data and processed_data maps of an analysis: should we add arbitrary unit-ed fields here? For example, in addition to having float_value, string_value, etc., should we have Temperature temperature_value, Mass mass_value, etc.?

Just to clarify, are you thinking about expanding the Data message or adding these to ReactionAnalysis?

connorcoley commented 4 years ago

New question for discussion, relevant to the raw_data and processed_data maps of an analysis: should we add arbitrary unit-ed fields here? For example, in addition to having float_value, string_value, etc., should we have Temperature temperature_value, Mass mass_value, etc.?

Just to clarify, are you thinking about expanding the Data message or adding these to ReactionAnalysis?

This would be an expansion of the Data message

michaelmaser commented 4 years ago

New question for discussion, relevant to the raw_data and processed_data maps of an analysis: should we add arbitrary unit-ed fields here? For example, in addition to having float_value, string_value, etc., should we have Temperature temperature_value, Mass mass_value, etc.?

I like the idea of including these structured / unit-ed Data types here. Could open opportunity to add spectra / listed data forms later on

skearnes commented 4 years ago

New question for discussion, relevant to the raw_data and processed_data maps of an analysis: should we add arbitrary unit-ed fields here? For example, in addition to having float_value, string_value, etc., should we have Temperature temperature_value, Mass mass_value, etc.?

I like the idea of including these structured / unit-ed Data types here. Could open opportunity to add spectra / listed data forms later on

I think I'm opposed to extending the Data message. It's designed to be the place to put things that have no other place---an unstructured "catch all" message. I recognize that there could be structured data that has no home, but I think that should be fixed with changes to the schema and not feature creep of Data.

I'd be more interested in adding these fields to ReactionAnalysis.

michaelmaser commented 4 years ago

I'd be more interested in adding these fields to ReactionAnalysis.

SGTM

michaelmaser commented 4 years ago

LGTM + readability

skearnes commented 4 years ago

Thinking about migration: I think it will make migration easier if we don't delete anything from the schema. This way all of the old protos will continue to load properly and we can move things over without needing two copies of the repo (IIUC the current *.pbtxt messages won't parse after their fields are removed). It will be really ugly, but will make the migration much easier and then we'll clean everything up.

I'm thinking of something like this:

...
// BEGIN DEPRECATED
message Feature {     map<string, Data> features = 15;
  string name = 1;    // Compounds may be assayed for quality control; analytical data should be
  oneof kind {    // defined in the analyses map.
    string string_value = 2;      map<string, Analysis> analyses = 16;
    float float_value = 3;  
  } 
  // Details of computation. Software, theory level, solvation, etc.    
  string how_computed = 4;  
}   
repeated Feature features = 13;
// END DEPRECATED
map<string, Data> features = 15;
// Compounds may be assayed for quality control; analytical data should be
// defined in the analyses map.
map<string, Analysis> analyses = 16;
...

Is that too much trouble?

connorcoley commented 4 years ago

Thinking about migration: I think it will make migration easier if we don't delete anything from the schema. ... Is that too much trouble?

No, I think that's reasonable. It will be very ugly, to be sure, but will make migration significantly easier. There will be some edge cases where it's not possible to do that because of field name conflicts (e.g., we're changing the type of Compound.features but keeping the name), but the majority of cases will be fine

connorcoley commented 4 years ago

Thinking about migration: I think it will make migration easier if we don't delete anything from the schema. ... Is that too much trouble?

No, I think that's reasonable. It will be very ugly, to be sure, but will make migration significantly easier. There will be some edge cases where it's not possible to do that because of field name conflicts (e.g., we're changing the type of Compound.features but keeping the name), but the majority of cases will be fine

I might rescind that answer. We can't do it for Compound.features or ReactionOutcome.analyses unless we want to choose different names for the fields, which I wouldn't advise.

I've added in the old schema as reaction_old.proto. This might seem sloppy, but we can at least make the transition since both versions will be importable in the same environment

skearnes commented 4 years ago

Thinking about migration: I think it will make migration easier if we don't delete anything from the schema. ... Is that too much trouble?

No, I think that's reasonable. It will be very ugly, to be sure, but will make migration significantly easier. There will be some edge cases where it's not possible to do that because of field name conflicts (e.g., we're changing the type of Compound.features but keeping the name), but the majority of cases will be fine

I might rescind that answer. We can't do it for Compound.features or ReactionOutcome.analyses unless we want to choose different names for the fields, which I wouldn't advise.

I've added in the old schema as reaction_old.proto. This might seem sloppy, but we can at least make the transition since both versions will be importable in the same environment

SGTM. Thanks.

codecov[bot] commented 3 years ago

Codecov Report

Merging #496 (beec6f6) into main (3a1df29) will decrease coverage by 1.25%. The diff coverage is 51.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #496      +/-   ##
==========================================
- Coverage   76.77%   75.51%   -1.26%     
==========================================
  Files          24       24              
  Lines        2118     2169      +51     
  Branches      472      492      +20     
==========================================
+ Hits         1626     1638      +12     
- Misses        347      380      +33     
- Partials      145      151       +6     
Impacted Files Coverage Δ
ord_schema/visualization/filters.py 53.17% <22.22%> (-1.43%) :arrow_down:
ord_schema/validations.py 69.46% <49.12%> (-0.63%) :arrow_down:
ord_schema/message_helpers.py 84.64% <68.00%> (-0.66%) :arrow_down:
ord_schema/interface/build_database.py 81.81% <75.00%> (+0.13%) :arrow_up:
ord_schema/resolvers.py 74.71% <100.00%> (ø)
ord_schema/templating.py 94.82% <100.00%> (ø)
ord_schema/interface/ord_client.py 39.47% <0.00%> (-42.11%) :arrow_down:
connorcoley commented 3 years ago

Here is a snippet from the updated Liu et al. OrgSyn example:

## Product and characterization
outcome = reaction.outcomes.add()
outcome.reaction_time.CopyFrom(unit_resolver.resolve('6 h'))
analysis = reaction_pb2.Analysis() # for handy enum reference

outcome.analyses['isolated_weight'].type = analysis.WEIGHT
outcome.analyses['isolated_weight'].is_of_isolated_species = True
outcome.analyses['isolated_weight'].details = '27.5-28.0 g recovered after workup'

outcome.analyses['1H NMR'].type = analysis.NMR_1H
outcome.analyses['1H NMR'].is_of_isolated_species = True
outcome.analyses['1H NMR'].details = '400 MHz, CDCl3'
outcome.analyses['1H NMR'].data['peaks'].string_value = r'0.92 (s, 9H), 4.06 (s, 4H), 7.23 - 7.34 (m, 6H), 7.40 (d, J = 7.1 Hz, 4H)'
outcome.analyses['1H NMR'].data['peaks'].description = 'List of peaks'

outcome.analyses['13C NMR'].type = analysis.NMR_13C
outcome.analyses['13C NMR'].is_of_isolated_species = True
outcome.analyses['13C NMR'].details = '101 MHz, CDCl3'
outcome.analyses['13C NMR'].data['peaks'].string_value = r'27.1, 38.4, 62.4, 127.7, 128.3, 129.6, 136.2, 176.3'
outcome.analyses['13C NMR'].data['peaks'].description = 'List of peaks'

outcome.analyses['thin film IR'].type = analysis.IR
outcome.analyses['thin film IR'].is_of_isolated_species = True
outcome.analyses['thin film IR'].details = 'neat film, NaCl'
outcome.analyses['thin film IR'].data['peaks'].string_value = r'3064, 3031, 2973, 2932, 2906, 2872, 1751, 1496, 1479, 1456, 1273, 1116, 1029, 738, 698'
outcome.analyses['thin film IR'].data['peaks'].description = 'List of peaks [cm-1]'

outcome.analyses['HRMS'].type = analysis.HRMS
outcome.analyses['HRMS'].is_of_isolated_species = True
outcome.analyses['HRMS'].details = 'ESI-TOF'
outcome.analyses['HRMS'].data['expected'].float_value = 298.1802
outcome.analyses['HRMS'].data['expected'].description = 'Expected m/z'
outcome.analyses['HRMS'].data['found'].float_value = 298.1794
outcome.analyses['HRMS'].data['found'].description = 'Observed m/z'

outcome.analyses['quantitative NMR'].type = analysis.NMR_1H
outcome.analyses['quantitative NMR'].is_of_isolated_species = True
outcome.analyses['quantitative NMR'].details = 'Quantitative NMR using 1,1,2,2-tetrachloroethane (>98%, purchased from Alfa Aesar) in CDCl3'

# A single product was desired and characterized
product = outcome.products.add()
product.identifiers.add(type='SMILES', value='O=C(C(C)(C)C)ON(CC1=CC=CC=C1)CC2=CC=CC=C2')
product.identifiers.add(type='NAME', value='N,N-Dibenzyl-O-pivaloylhydroxylamine')
product.mass.CopyFrom(unit_resolver.resolve('27.75 g'))

# Define which analyses were used for which aspects of characterization
product.measurements.add(type='IDENTITY', analysis_key='1H NMR')
product.measurements.add(type='IDENTITY', analysis_key='13C NMR')
product.measurements.add(type='IDENTITY', analysis_key='HRMS')
product.measurements.add(type='YIELD', analysis_key='isolated_weight', percentage=dict(value=93.5, precision=0.5))
product.measurements.add(type='PURITY', analysis_key='quantitative NMR', percentage=dict(value=99))
michaelmaser commented 3 years ago

Updates look wonderful. Thanks, Connor!

# Define which analyses were used for which aspects of characterization
product.measurements.add(type='IDENTITY', analysis_key='1H NMR')
product.measurements.add(type='IDENTITY', analysis_key='13C NMR')
product.measurements.add(type='IDENTITY', analysis_key='HRMS')

Would just consider adding the IR analysis as an IDENTITY measurement as well: product.measurements.add(type='IDENTITY', analysis_key='thin film IR')

Can look at notebooks soon.

skearnes commented 3 years ago

@michaelmaser any additional comments? It would be great to merge this today.

michaelmaser commented 3 years ago

Will take a look today. Sorry for delay, been a busy week.

michaelmaser commented 3 years ago

Notebooks look great. Left some comments, but think anything could be addressed later if desired. Otherwise, LGTM.

connorcoley commented 3 years ago

@skearnes good to merge?

skearnes commented 3 years ago

@skearnes good to merge?

Please resolve Mike's outstanding comments, and then we're good to go! Thanks for pushing this forward. After the merge, I'll create a new release for ord-schema and start working on the migration for the other repos. I'll work on new branches from main so we can all contribute, like we did here.

skearnes commented 3 years ago

PS it's fine with me if you want to merge now and address e.g. separating standards in subsequent PRs. This PR is getting huge and we can start working on the migration in the meantime...