Closed connorcoley closed 3 years ago
Resolves #432 (where most discussion has been), resolves #386, resolves #420, resolves #426
For discussion here:
- Is it redundant to have both "counts" and "intensity"? I feel like the two aren't completely interchangeable, but there is certainly ambiguity.
I don't have strong feelings about this, but I think if they are not interchangeable then keeping them separate is fine.
- Are we being completely consistent in terms of holding product-specific analytical data in this new ProductMeasurement field? In particular, I'm thinking about IR and HRMS of an isolated compound. The entire analysis is of the presumably-isolated product. Should ProductMeasurement accommodate a structured definition of the full IR spectrum? We shied away from this idea (#48) since it got rather complicated. However, HRMS is a very easy measurement to capture in a structured format -- the dominant peak is just a float m/z. It seems silly not to include this in the ProductMeasurement field. The temperature of a MP experiment or the angle of a CD experiment are similar.
I'm not opposed to adding structured analysis messages piecemeal, especially if we can start with easy ones. We don't have to try to cover all the possibilities at once (and adding new ones to ProductMeasurement later will be backwards compatible).
- To broaden the previous point, some analyses are applied to crude product mixtures, while others are applied to thought-to-be-pure compounds. in the former case, we definitely need this ProductMeasurement message to create a closer association between an analysis and a single product. But in the latter case, there is already an implicit connection. Some analyses result in a scalar value while others result in more structured data. One option is to add a Data field to the ProductMeasurement and a SPECTRUM type, but then there is ambiguity around when to use that field rather than the ReactionAnalysis's raw_data and processed_data maps. If we do add a Data field to ProductMeasurement, it would certainly be very handy for non-HTE data where people do full product characterizations; we would have access to many examples of structures and their NMR/IR with which to build predictive models.
This is indeed tricky. Maybe we need to think more broadly about how the analysis/product pairs should be structured. My current mental model is that an analysis may contain data that covers one or more compounds, and ProductMeasurement is the place to store compound-specific data extracted from an analysis. So I think that structured analysis data might be best placed in the ReactionAnalysis and summary statistics (like a specific peak area or quantified yield) are best placed in ProductMeasurement.
That doesn't really resolve what to do about single-compound analyses, like melting point, where the data could be in either place?
- I've kept color and texture in the ReactionProduct itself. They are not supposed to be tied to a specific analysis (unless we add "VISUAL_INSPECTION" as an analysis which is silly).
SGTM.
- I don't think that uses_internal_standard and uses_authentic_standard need to be defined for the analysis, as they need to be defined for each (product, analysis) to handle the situation where multiple products are quantified and not all necessarily use an authentic or internal standard.
Agreed; the placement in ProductMeasurement seems appropriate.
Note that this schema change breaks many things (hence the draft PR) and is not backwards compatible
Huzzah for alpha testing!
- Are we being completely consistent in terms of holding product-specific analytical data in this new ProductMeasurement field? In particular, I'm thinking about IR and HRMS of an isolated compound. The entire analysis is of the presumably-isolated product. Should ProductMeasurement accommodate a structured definition of the full IR spectrum? We shied away from this idea (#48) since it got rather complicated. However, HRMS is a very easy measurement to capture in a structured format -- the dominant peak is just a float m/z. It seems silly not to include this in the ProductMeasurement field. The temperature of a MP experiment or the angle of a CD experiment are similar.
I'm not opposed to adding structured analysis messages piecemeal, especially if we can start with easy ones. We don't have to try to cover all the possibilities at once (and adding new ones to ProductMeasurement later will be backwards compatible).
For IR, a similar field to HRMS but just repeated could be sufficient. If needed, an IR spectrum in a message could be captured by a repeated float value
- To broaden the previous point, some analyses are applied to crude product mixtures, while others are applied to thought-to-be-pure compounds. in the former case, we definitely need this ProductMeasurement message to create a closer association between an analysis and a single product. But in the latter case, there is already an implicit connection. Some analyses result in a scalar value while others result in more structured data. One option is to add a Data field to the ProductMeasurement and a SPECTRUM type, but then there is ambiguity around when to use that field rather than the ReactionAnalysis's raw_data and processed_data maps. If we do add a Data field to ProductMeasurement, it would certainly be very handy for non-HTE data where people do full product characterizations; we would have access to many examples of structures and their NMR/IR with which to build predictive models.
A repeated float value
of peak chemical shifts could almost capture an NMR spectrum too, except for 1H where string
splitting information is also needed. We could try to build a single Spectrum
message that could handle these, IR, and others with repeated float value
and details
fields, or could build specific messages for each expected spectrum type that isn't just floats
My current mental model is that an analysis may contain data that covers one or more compounds, and ProductMeasurement is the place to store compound-specific data extracted from an analysis.
I agree with this model, where ReactionAnalysis
is used largely for mixtures or in-progress/crude analysis, and ProductMeasurement
is used for single-component or purified/final compound analysis.
So I think that structured analysis data might be best placed in the ReactionAnalysis and summary statistics (like a specific peak area or quantified yield) are best placed in ProductMeasurement.
I think this is fine, but if "structured analysis data" is carried out on a purified compound or is being used to determine yield/identity of a product, I think I would prefer it be placed in ProductMeasurement
(I guess like you're saying for "specific peak area or quantified yield").
That doesn't really resolve what to do about single-compound analyses, like melting point, where the data could be in either place?
As above, I personally think these single-compound analyses should be contained in ReactionProduct
.
- I've kept color and texture in the ReactionProduct itself. They are not supposed to be tied to a specific analysis (unless we add "VISUAL_INSPECTION" as an analysis which is silly).
Agreed.
- I don't think that uses_internal_standard and uses_authentic_standard need to be defined for the analysis, as they need to be defined for each (product, analysis) to handle the situation where multiple products are quantified and not all necessarily use an authentic or internal standard.
Agreed; the placement in ProductMeasurement seems appropriate.
Right, sounds good.
Discussion notes:
New question for discussion, relevant to the raw_data
and processed_data
maps of an analysis: should we add arbitrary unit-ed fields here? For example, in addition to having float_value, string_value, etc., should we have Temperature temperature_value
, Mass mass_value
, etc.?
New question for discussion, relevant to the
raw_data
andprocessed_data
maps of an analysis: should we add arbitrary unit-ed fields here? For example, in addition to having float_value, string_value, etc., should we haveTemperature temperature_value
,Mass mass_value
, etc.?
Just to clarify, are you thinking about expanding the Data
message or adding these to ReactionAnalysis
?
New question for discussion, relevant to the
raw_data
andprocessed_data
maps of an analysis: should we add arbitrary unit-ed fields here? For example, in addition to having float_value, string_value, etc., should we haveTemperature temperature_value
,Mass mass_value
, etc.?Just to clarify, are you thinking about expanding the
Data
message or adding these toReactionAnalysis
?
This would be an expansion of the Data
message
New question for discussion, relevant to the
raw_data
andprocessed_data
maps of an analysis: should we add arbitrary unit-ed fields here? For example, in addition to having float_value, string_value, etc., should we haveTemperature temperature_value
,Mass mass_value
, etc.?
I like the idea of including these structured / unit-ed Data
types here. Could open opportunity to add spectra / listed data forms later on
New question for discussion, relevant to the
raw_data
andprocessed_data
maps of an analysis: should we add arbitrary unit-ed fields here? For example, in addition to having float_value, string_value, etc., should we haveTemperature temperature_value
,Mass mass_value
, etc.?I like the idea of including these structured / unit-ed
Data
types here. Could open opportunity to add spectra / listed data forms later on
I think I'm opposed to extending the Data
message. It's designed to be the place to put things that have no other place---an unstructured "catch all" message. I recognize that there could be structured data that has no home, but I think that should be fixed with changes to the schema and not feature creep of Data
.
I'd be more interested in adding these fields to ReactionAnalysis
.
I'd be more interested in adding these fields to
ReactionAnalysis
.
SGTM
LGTM + readability
Thinking about migration: I think it will make migration easier if we don't delete anything from the schema. This way all of the old protos will continue to load properly and we can move things over without needing two copies of the repo (IIUC the current *.pbtxt messages won't parse after their fields are removed). It will be really ugly, but will make the migration much easier and then we'll clean everything up.
I'm thinking of something like this:
...
// BEGIN DEPRECATED
message Feature { map<string, Data> features = 15;
string name = 1; // Compounds may be assayed for quality control; analytical data should be
oneof kind { // defined in the analyses map.
string string_value = 2; map<string, Analysis> analyses = 16;
float float_value = 3;
}
// Details of computation. Software, theory level, solvation, etc.
string how_computed = 4;
}
repeated Feature features = 13;
// END DEPRECATED
map<string, Data> features = 15;
// Compounds may be assayed for quality control; analytical data should be
// defined in the analyses map.
map<string, Analysis> analyses = 16;
...
Is that too much trouble?
Thinking about migration: I think it will make migration easier if we don't delete anything from the schema. ... Is that too much trouble?
No, I think that's reasonable. It will be very ugly, to be sure, but will make migration significantly easier. There will be some edge cases where it's not possible to do that because of field name conflicts (e.g., we're changing the type of Compound.features
but keeping the name), but the majority of cases will be fine
Thinking about migration: I think it will make migration easier if we don't delete anything from the schema. ... Is that too much trouble?
No, I think that's reasonable. It will be very ugly, to be sure, but will make migration significantly easier. There will be some edge cases where it's not possible to do that because of field name conflicts (e.g., we're changing the type of
Compound.features
but keeping the name), but the majority of cases will be fine
I might rescind that answer. We can't do it for Compound.features
or ReactionOutcome.analyses
unless we want to choose different names for the fields, which I wouldn't advise.
I've added in the old schema as reaction_old.proto
. This might seem sloppy, but we can at least make the transition since both versions will be importable in the same environment
Thinking about migration: I think it will make migration easier if we don't delete anything from the schema. ... Is that too much trouble?
No, I think that's reasonable. It will be very ugly, to be sure, but will make migration significantly easier. There will be some edge cases where it's not possible to do that because of field name conflicts (e.g., we're changing the type of
Compound.features
but keeping the name), but the majority of cases will be fineI might rescind that answer. We can't do it for
Compound.features
orReactionOutcome.analyses
unless we want to choose different names for the fields, which I wouldn't advise.I've added in the old schema as
reaction_old.proto
. This might seem sloppy, but we can at least make the transition since both versions will be importable in the same environment
SGTM. Thanks.
Merging #496 (beec6f6) into main (3a1df29) will decrease coverage by
1.25%
. The diff coverage is51.85%
.
@@ Coverage Diff @@
## main #496 +/- ##
==========================================
- Coverage 76.77% 75.51% -1.26%
==========================================
Files 24 24
Lines 2118 2169 +51
Branches 472 492 +20
==========================================
+ Hits 1626 1638 +12
- Misses 347 380 +33
- Partials 145 151 +6
Impacted Files | Coverage Δ | |
---|---|---|
ord_schema/visualization/filters.py | 53.17% <22.22%> (-1.43%) |
:arrow_down: |
ord_schema/validations.py | 69.46% <49.12%> (-0.63%) |
:arrow_down: |
ord_schema/message_helpers.py | 84.64% <68.00%> (-0.66%) |
:arrow_down: |
ord_schema/interface/build_database.py | 81.81% <75.00%> (+0.13%) |
:arrow_up: |
ord_schema/resolvers.py | 74.71% <100.00%> (ø) |
|
ord_schema/templating.py | 94.82% <100.00%> (ø) |
|
ord_schema/interface/ord_client.py | 39.47% <0.00%> (-42.11%) |
:arrow_down: |
Here is a snippet from the updated Liu et al. OrgSyn example:
## Product and characterization
outcome = reaction.outcomes.add()
outcome.reaction_time.CopyFrom(unit_resolver.resolve('6 h'))
analysis = reaction_pb2.Analysis() # for handy enum reference
outcome.analyses['isolated_weight'].type = analysis.WEIGHT
outcome.analyses['isolated_weight'].is_of_isolated_species = True
outcome.analyses['isolated_weight'].details = '27.5-28.0 g recovered after workup'
outcome.analyses['1H NMR'].type = analysis.NMR_1H
outcome.analyses['1H NMR'].is_of_isolated_species = True
outcome.analyses['1H NMR'].details = '400 MHz, CDCl3'
outcome.analyses['1H NMR'].data['peaks'].string_value = r'0.92 (s, 9H), 4.06 (s, 4H), 7.23 - 7.34 (m, 6H), 7.40 (d, J = 7.1 Hz, 4H)'
outcome.analyses['1H NMR'].data['peaks'].description = 'List of peaks'
outcome.analyses['13C NMR'].type = analysis.NMR_13C
outcome.analyses['13C NMR'].is_of_isolated_species = True
outcome.analyses['13C NMR'].details = '101 MHz, CDCl3'
outcome.analyses['13C NMR'].data['peaks'].string_value = r'27.1, 38.4, 62.4, 127.7, 128.3, 129.6, 136.2, 176.3'
outcome.analyses['13C NMR'].data['peaks'].description = 'List of peaks'
outcome.analyses['thin film IR'].type = analysis.IR
outcome.analyses['thin film IR'].is_of_isolated_species = True
outcome.analyses['thin film IR'].details = 'neat film, NaCl'
outcome.analyses['thin film IR'].data['peaks'].string_value = r'3064, 3031, 2973, 2932, 2906, 2872, 1751, 1496, 1479, 1456, 1273, 1116, 1029, 738, 698'
outcome.analyses['thin film IR'].data['peaks'].description = 'List of peaks [cm-1]'
outcome.analyses['HRMS'].type = analysis.HRMS
outcome.analyses['HRMS'].is_of_isolated_species = True
outcome.analyses['HRMS'].details = 'ESI-TOF'
outcome.analyses['HRMS'].data['expected'].float_value = 298.1802
outcome.analyses['HRMS'].data['expected'].description = 'Expected m/z'
outcome.analyses['HRMS'].data['found'].float_value = 298.1794
outcome.analyses['HRMS'].data['found'].description = 'Observed m/z'
outcome.analyses['quantitative NMR'].type = analysis.NMR_1H
outcome.analyses['quantitative NMR'].is_of_isolated_species = True
outcome.analyses['quantitative NMR'].details = 'Quantitative NMR using 1,1,2,2-tetrachloroethane (>98%, purchased from Alfa Aesar) in CDCl3'
# A single product was desired and characterized
product = outcome.products.add()
product.identifiers.add(type='SMILES', value='O=C(C(C)(C)C)ON(CC1=CC=CC=C1)CC2=CC=CC=C2')
product.identifiers.add(type='NAME', value='N,N-Dibenzyl-O-pivaloylhydroxylamine')
product.mass.CopyFrom(unit_resolver.resolve('27.75 g'))
# Define which analyses were used for which aspects of characterization
product.measurements.add(type='IDENTITY', analysis_key='1H NMR')
product.measurements.add(type='IDENTITY', analysis_key='13C NMR')
product.measurements.add(type='IDENTITY', analysis_key='HRMS')
product.measurements.add(type='YIELD', analysis_key='isolated_weight', percentage=dict(value=93.5, precision=0.5))
product.measurements.add(type='PURITY', analysis_key='quantitative NMR', percentage=dict(value=99))
Updates look wonderful. Thanks, Connor!
# Define which analyses were used for which aspects of characterization product.measurements.add(type='IDENTITY', analysis_key='1H NMR') product.measurements.add(type='IDENTITY', analysis_key='13C NMR') product.measurements.add(type='IDENTITY', analysis_key='HRMS')
Would just consider adding the IR analysis as an IDENTITY measurement as well:
product.measurements.add(type='IDENTITY', analysis_key='thin film IR')
Can look at notebooks soon.
@michaelmaser any additional comments? It would be great to merge this today.
Will take a look today. Sorry for delay, been a busy week.
Notebooks look great. Left some comments, but think anything could be addressed later if desired. Otherwise, LGTM.
@skearnes good to merge?
@skearnes good to merge?
Please resolve Mike's outstanding comments, and then we're good to go! Thanks for pushing this forward. After the merge, I'll create a new release for ord-schema and start working on the migration for the other repos. I'll work on new branches from main
so we can all contribute, like we did here.
PS it's fine with me if you want to merge now and address e.g. separating standards
in subsequent PRs. This PR is getting huge and we can start working on the migration in the meantime...
Resolves #432 (where most discussion has been), resolves #386, resolves #420, resolves #426, resolves #466, resolves #506
For discussion here:
Note that this schema change is not backwards compatible. The old schema is included in this repo to make data transfer easier.