microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

create class for metap gene function aggregation #1253

Open aclum opened 11 months ago

aclum commented 11 months ago

We added a nmdc schema Class for the gene aggregation for the sequencing workflows, see https://microbiomedata.github.io/nmdc-schema/FunctionalAnnotationAggMember/, but not for the metaproteomics. I believe we wanted all the collections in mongo to have a slot on class Database, in order to do that we need a Class for the metaproteomics gene function aggregation. The collection in prod mongo that current contains this data is 'metap_gene_function_aggregation'

related to https://github.com/microbiomedata/nmdc-schema/issues/1252

Based on that we need class with slots (based on how the collection is currently populated)

metaproteomic_analysis_id gene_function_id (slot already exists -update domain_of for this slot) count (slot already exists - update domain_of for this slot) is_best_protein (need new slot)

all slots should be required

Then Class Database needs a slot 'metap_gene_function_aggregation' with a range of the new metap gene function aggregation Class

@SamuelPurvine @mslarae13 @pdpiehowski @picowatt please confirm requirements, in particular if the slots should be required or not.

cc @turbomam

aclum commented 11 months ago

This was discussed today at the metap meeting and folks agree on adding this as a Class to the schema with the slots as required. I talked with @cmungall about this last week in person and IGB and he was supportive as well.

mslarae13 commented 10 months ago

@aclum is this a "WorkflowExecution" subclass based on the new schema structure we're doing?

SamuelPurvine commented 10 months ago

And, as mentioned over in issue 1309, the best_protein slot in this aggregation class should be renamed to something like has_best_protein or is_best_protein and be given range boolian false/true.

aclum commented 10 months ago

@mslarae13 no, this is an aggregation table from the output of the MetaproteomicsAnalysisActivity/ i am setting this up the same way that @turbomam created the FunctionalAnnotationAggMember class to aggregate the sequencing based annotation results.

aclum commented 10 months ago

@naglepuff or @marySalvi Could you comment on what fields from the metap_gene_function_aggregation collection the data portal ingest code uses? We have to change the name of best_protein since it is already being used in a different way within the schema.

ssarrafan commented 9 months ago

@naglepuff @marySalvi can you please respond to this issue?

Will move to the next sprint. @aclum let me know if it should be in the backlog.

naglepuff commented 9 months ago

We use:

gene_function_id count best_protein

aclum commented 9 months ago

Okay, we need to change the name of that so just a note that this will be a minor breaking change to the data portal. best_protein will be come is_best_protein @naglepuff

aclum commented 9 months ago

@mbthornton-lbl this is ready for the migration code to be written.

aclum commented 9 months ago

Migration code has been written. Ready for code review and to get merged in but I think Mark as signed off for the year so will move to 'in review' for the first sprint in January.

turbomam commented 8 months ago

@aclum @lamccue @pdpiehowski @SamuelPurvine @scanon @GrantFujimoto @picowatt

see https://github.com/microbiomedata/nmdc_automation/blob/metap_agg/src/generate_metap_agg.py

I don't think a metaproteomics-based functional annotation aggregation record should have a slot called is_best_protein.

These aggregations are about Biosamples, via the annotation process that took the Biosample's data objects as input. They say "there are N annotations that say this Biosample has at least one gene with some function". The boolean value we are talking about indicates whether any of the individual annotations that have been aggregated are evidenced by a best protein accounting of an observed peptide.

Therefore I propose that the boolean slot should be called evidenced_by_best_protein

the description or a comment could be

Is this aggregated functional annotation on a Biosample based on functions that are associated with some protein that is the best (most parsimonious) accounting for a peptide, whose presence was inferred from some technology like mass spectrometry

turbomam commented 8 months ago

This is all irrelevant if we decide that non-best-evidenced functions should not be stored or exposed in the data portal

aclum commented 8 months ago

This task is on hold for now until such time as @SamuelPurvine and @pdpiehowski and others figure out what changes the metaproteomics workflow metadata will look like. Removing from sprint and adding backlog label. Please see notes from January 9, 2024 NMDC Workflows workflows meeting.

eecavanna commented 4 months ago

One implication of this class definition not being in the schema is that code that performs validation based upon the schema can't check data representing instances of this class (without "hard-coding" special validation code for this class).

Example: Referential integrity validation

For example, code that uses the schema to determine which fields of which Mongo documents might contain the ids of other documents, and which Mongo collections those other documents reside in, wouldn't know where to look for referenced documents that represent instances of the (hypothetical) MetaPGeneFunctionAggMember class.

Example: Document validation

As another example, code that validates whether a given Mongo document conforms to the schema would not be able to validate documents in the (hypothetical) metap_gene_function_agg_member_set collection.

The referential integrity validation-related implication came up recently as some team members have been developing referential integrity checkers.

eecavanna commented 4 months ago

I think this is going to be implemented after the Berkeley schema gets merged into the upstream schema.

SamuelPurvine commented 4 months ago

@eecavanna that is correct. There's a fair amount of churn about this task, including

All of this would more or less obviate the re-id-ing effort work for metaproteomics (not currently performed as of this writing), and also break the aggregation table implementation, which 'just' got fixed to work with current data. Getting the metaproteomics re-id-ing done will run up against the Berkeley schema light freeze in a way that virtually guarantees aggregation of technical debt if we try and "make it happen".

kheal commented 1 month ago

@aclum and @SamuelPurvine. I've given this some thought, and below is my current proposal. I'd appreciate your input before I try to implement in the schema.

The proteomcis team is moving away from using any annotation of "best protein" and instead will be supplying a list of proteins annotated (at a high level of scrutiny) for each MetaproteomicsAnalysis (see https://github.com/microbiomedata/nmdc-schema/issues/2157). For the resulting aggregation, I believe we will simply have a functional gene ID and a MetaproteomicsAnalysis ID; which can fit in the current structure of the FunctionalAnnotationAggMember class with minimal changes.

See proposal below:

Screenshot 2024-08-21 at 9 51 33 AM

SamuelPurvine commented 1 month ago

Makes sense to me. Curious about whether adding values direxctly to mongoDB for gene_function_id would/will be possible when we start using non-NMDC-generated-metagenome dependent functional annotations? It certainly seems the structure you have here would make relaxing this dependency easier.

Just as an aside, current thinking for replacing best_protein is to use razor_protein with parsimony such that: 1) peptides that belong to one protein have that protein as its razor_protein; 2) peptides belonging to multiple proteins (i.e. non-unique), only one of which has a separate unique peptide, will associate with that protein as the razor_protein; 3) non-unique peptides belonging to proteins with no other unique peptide will be associated with one or more proteins having a maximal count of other peptides associated, with these being the razor_protein entries. We would then upload the entire list of the razor_protein entries for a given WorkflowExecution to allow access via API and aggregation as per @kheal's above suggestion. In the case where there isn't matching between the razor_protein and a GeneProduct from a metagenomics analysis, the WorkflowExecution would supply the list of 'gene_function_id' entries directly to the aggregation table, which is why I'm asking the above question.

I do think this would require the 'WorkflowExecution' to be "aware" of currently extant 'gene_function_id' entries to avoid duplication... Or try and make the import smart enough to ignore attempts at adding duplicates? Sorry for the free thought blathering here...

kheal commented 1 month ago

@SamuelPurvine

I do think this would require the 'WorkflowExecution' to be "aware" of currently extant 'gene_function_id' entries to avoid duplication... Or try and make the import smart enough to ignore attempts at adding duplicates? Sorry for the free thought blathering here...

I don't think I quite understand this part. What records are you worried about duplicating? The FunctionalAnnotationAggMember tables when we upload new MetaproteomicsAnalysis records? Those are unique combinations of WorkflowExecution:GeneProduct, so I don't think we need to worry about duplication from that sense.

Or are you worried about current mongo records being duplicated when we implement this? If so, @aclum and I have discussed removing all existing proteomics aggregation table records from mongo and regenerating with the chron job after all the schema and scripting for the aggregation tables have been updated.

SamuelPurvine commented 1 month ago

I mean specifically that there is a separate storage mechanism for gene_function_id values, is there not? If the metagenome-free workflow comes up with a gene_function_id that is already in that data chunk, we wouldn't want it to be added, only things not yet entered via the metagenome route. Maybe I'm just thinking further ahead than I should be :)

kheal commented 1 month ago

I see. I think we will have to deal with that (either or possibly both in the metadata generation for the MetaproteomicsAnalysis or in the aggregation scripts). But yes, I think that's a 'later us' problem that will become more clear once we have example data that we need to contort into the correct shape. It's encouraging to hear that you think the proposed FunctionalAnnotationAggMember class could work with those data though.

SamuelPurvine commented 1 month ago

Yes, this proposal generalizes the FunctionalAnnotationAggMember' class to allow for, checks notes, anyWorkflowExectution` to make entries into the functional annotation space.

Is anyone thinking to add a description anywhere for the count slot. Would love to know explicitly what we are counting for this aggregation class...

aclum commented 1 month ago

This is sleek, i like it. WRT to duplicates I think we'd want to store both in the aggregation table, but then handle display with nmdc-server logic. It's potentially useful to know the venn diagram of shared gene_function_id between reference based and reference free metap workflows.

SamuelPurvine commented 1 month ago

@aclum excellent point! might we think of adding a/the source for a given functional annotation to more easily enable this, or just use the workflow "version", however we are going to denote suchlike? I suppose you could build a query that walks up the chain to biosample, walks back down to a/the related metagenome, pulls an example GeneProduct name and compares it to the GeneProduct IDs from the WorkflowExecution in question... if there's more than one metagenome run then it's moar fun...

aclum commented 1 month ago

I'd assumed the reference free workflow would have a different workflow exeuction subclass name which would allow for easy aggregations. Since the current results are only presence/absence no work has been done on deduplicating such that if reference based annotation runs twice it would only show the results for the latest version.