Open aclum opened 11 months ago
This was discussed today at the metap meeting and folks agree on adding this as a Class to the schema with the slots as required. I talked with @cmungall about this last week in person and IGB and he was supportive as well.
@aclum is this a "WorkflowExecution" subclass based on the new schema structure we're doing?
And, as mentioned over in issue 1309, the best_protein slot in this aggregation class should be renamed to something like has_best_protein or is_best_protein and be given range boolian false/true.
@mslarae13 no, this is an aggregation table from the output of the MetaproteomicsAnalysisActivity/ i am setting this up the same way that @turbomam created the FunctionalAnnotationAggMember class to aggregate the sequencing based annotation results.
@naglepuff or @marySalvi Could you comment on what fields from the metap_gene_function_aggregation collection the data portal ingest code uses? We have to change the name of best_protein since it is already being used in a different way within the schema.
@naglepuff @marySalvi can you please respond to this issue?
Will move to the next sprint. @aclum let me know if it should be in the backlog.
We use:
gene_function_id
count
best_protein
Okay, we need to change the name of that so just a note that this will be a minor breaking change to the data portal. best_protein will be come is_best_protein @naglepuff
@mbthornton-lbl this is ready for the migration code to be written.
Migration code has been written. Ready for code review and to get merged in but I think Mark as signed off for the year so will move to 'in review' for the first sprint in January.
@aclum @lamccue @pdpiehowski @SamuelPurvine @scanon @GrantFujimoto @picowatt
see https://github.com/microbiomedata/nmdc_automation/blob/metap_agg/src/generate_metap_agg.py
I don't think a metaproteomics-based functional annotation aggregation record should have a slot called is_best_protein
.
These aggregations are about Biosample
s, via the annotation process that took the Biosample
's data objects as input. They say "there are N annotations that say this Biosample has at least one gene with some function". The boolean value we are talking about indicates whether any of the individual annotations that have been aggregated are evidenced by a best protein accounting of an observed peptide.
Therefore I propose that the boolean slot should be called evidenced_by_best_protein
the description or a comment could be
Is this aggregated functional annotation on a Biosample based on functions that are associated with some protein that is the best (most parsimonious) accounting for a peptide, whose presence was inferred from some technology like mass spectrometry
This is all irrelevant if we decide that non-best-evidenced functions should not be stored or exposed in the data portal
This task is on hold for now until such time as @SamuelPurvine and @pdpiehowski and others figure out what changes the metaproteomics workflow metadata will look like. Removing from sprint and adding backlog label. Please see notes from January 9, 2024 NMDC Workflows workflows meeting.
One implication of this class definition not being in the schema is that code that performs validation based upon the schema can't check data representing instances of this class (without "hard-coding" special validation code for this class).
For example, code that uses the schema to determine which fields of which Mongo documents might contain the id
s of other documents, and which Mongo collections those other documents reside in, wouldn't know where to look for referenced documents that represent instances of the (hypothetical) MetaPGeneFunctionAggMember
class.
As another example, code that validates whether a given Mongo document conforms to the schema would not be able to validate documents in the (hypothetical) metap_gene_function_agg_member_set
collection.
The referential integrity validation-related implication came up recently as some team members have been developing referential integrity checkers.
I think this is going to be implemented after the Berkeley schema gets merged into the upstream schema.
@eecavanna that is correct. There's a fair amount of churn about this task, including
All of this would more or less obviate the re-id-ing effort work for metaproteomics (not currently performed as of this writing), and also break the aggregation table implementation, which 'just' got fixed to work with current data. Getting the metaproteomics re-id-ing done will run up against the Berkeley schema light freeze in a way that virtually guarantees aggregation of technical debt if we try and "make it happen".
@aclum and @SamuelPurvine. I've given this some thought, and below is my current proposal. I'd appreciate your input before I try to implement in the schema.
The proteomcis team is moving away from using any annotation of "best protein" and instead will be supplying a list of proteins annotated (at a high level of scrutiny) for each MetaproteomicsAnalysis
(see https://github.com/microbiomedata/nmdc-schema/issues/2157). For the resulting aggregation, I believe we will simply have a functional gene ID and a MetaproteomicsAnalysis
ID; which can fit in the current structure of the FunctionalAnnotationAggMember
class with minimal changes.
See proposal below:
Makes sense to me. Curious about whether adding values direxctly to mongoDB for gene_function_id
would/will be possible when we start using non-NMDC-generated-metagenome dependent functional annotations? It certainly seems the structure you have here would make relaxing this dependency easier.
Just as an aside, current thinking for replacing best_protein
is to use razor_protein
with parsimony such that: 1) peptides that belong to one protein have that protein as its razor_protein
; 2) peptides belonging to multiple proteins (i.e. non-unique), only one of which has a separate unique peptide, will associate with that protein as the razor_protein
; 3) non-unique peptides belonging to proteins with no other unique peptide will be associated with one or more proteins having a maximal count of other peptides associated, with these being the razor_protein
entries. We would then upload the entire list of the razor_protein
entries for a given WorkflowExecution
to allow access via API and aggregation as per @kheal's above suggestion. In the case where there isn't matching between the razor_protein
and a GeneProduct
from a metagenomics analysis, the WorkflowExecution
would supply the list of 'gene_function_id' entries directly to the aggregation table, which is why I'm asking the above question.
I do think this would require the 'WorkflowExecution' to be "aware" of currently extant 'gene_function_id' entries to avoid duplication... Or try and make the import smart enough to ignore attempts at adding duplicates? Sorry for the free thought blathering here...
@SamuelPurvine
I do think this would require the 'WorkflowExecution' to be "aware" of currently extant 'gene_function_id' entries to avoid duplication... Or try and make the import smart enough to ignore attempts at adding duplicates? Sorry for the free thought blathering here...
I don't think I quite understand this part. What records are you worried about duplicating? The FunctionalAnnotationAggMember
tables when we upload new MetaproteomicsAnalysis
records? Those are unique combinations of WorkflowExecution
:GeneProduct
, so I don't think we need to worry about duplication from that sense.
Or are you worried about current mongo records being duplicated when we implement this? If so, @aclum and I have discussed removing all existing proteomics aggregation table records from mongo and regenerating with the chron job after all the schema and scripting for the aggregation tables have been updated.
I mean specifically that there is a separate storage mechanism for gene_function_id values, is there not? If the metagenome-free workflow comes up with a gene_function_id that is already in that data chunk, we wouldn't want it to be added, only things not yet entered via the metagenome route. Maybe I'm just thinking further ahead than I should be :)
I see. I think we will have to deal with that (either or possibly both in the metadata generation for the MetaproteomicsAnalysis
or in the aggregation scripts). But yes, I think that's a 'later us' problem that will become more clear once we have example data that we need to contort into the correct shape. It's encouraging to hear that you think the proposed FunctionalAnnotationAggMember
class could work with those data though.
Yes, this proposal generalizes the FunctionalAnnotationAggMember' class to allow for, checks notes, any
WorkflowExectution` to make entries into the functional annotation space.
Is anyone thinking to add a description anywhere for the count
slot. Would love to know explicitly what we are counting for this aggregation class...
This is sleek, i like it. WRT to duplicates I think we'd want to store both in the aggregation table, but then handle display with nmdc-server logic. It's potentially useful to know the venn diagram of shared gene_function_id between reference based and reference free metap workflows.
@aclum excellent point! might we think of adding a/the source for a given functional annotation to more easily enable this, or just use the workflow "version", however we are going to denote suchlike? I suppose you could build a query that walks up the chain to biosample, walks back down to a/the related metagenome, pulls an example GeneProduct
name and compares it to the GeneProduc
t IDs from the WorkflowExecution
in question... if there's more than one metagenome run then it's moar fun...
I'd assumed the reference free workflow would have a different workflow exeuction subclass name which would allow for easy aggregations. Since the current results are only presence/absence no work has been done on deduplicating such that if reference based annotation runs twice it would only show the results for the latest version.
We added a nmdc schema Class for the gene aggregation for the sequencing workflows, see https://microbiomedata.github.io/nmdc-schema/FunctionalAnnotationAggMember/, but not for the metaproteomics. I believe we wanted all the collections in mongo to have a slot on class Database, in order to do that we need a Class for the metaproteomics gene function aggregation. The collection in prod mongo that current contains this data is 'metap_gene_function_aggregation'
related to https://github.com/microbiomedata/nmdc-schema/issues/1252
Based on that we need class with slots (based on how the collection is currently populated)
metaproteomic_analysis_id gene_function_id (slot already exists -update domain_of for this slot) count (slot already exists - update domain_of for this slot) is_best_protein (need new slot)
all slots should be required
Then Class Database needs a slot 'metap_gene_function_aggregation' with a range of the new metap gene function aggregation Class
@SamuelPurvine @mslarae13 @pdpiehowski @picowatt please confirm requirements, in particular if the slots should be required or not.
cc @turbomam