microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

Rename `has_metabolite_quantification` and associated class to `has_metabolite_identification` #1914

Closed kheal closed 3 months ago

kheal commented 3 months ago

has_metabolite_quantification slot on MetabolomicsAnalysis should be renamed to has_metabolite_identification

MetaboliteQuantification class should be renamed to MetaboliteIdentification

metabolite_quantified slot on MetaboliteQuantification class should be renamed to metabolite_identified

This will require a migration

ssarrafan commented 3 months ago

Who is this assigned to? @kheal

kheal commented 3 months ago

We haven't assigned this yet, it should not be on the current sprint board.

brynnz22 commented 3 months ago

For documentation purposes: we are changing these class and slot names because it misrepresents what is happening. Metabolites are not being quantified, but only identified.

turbomam commented 3 months ago

PS are we really planning on putting records for metabolite identification in MongoDB? We don't do anything like that for genomic results. The data volume could be huge.

@SamuelPurvine and I have been talking about saving proteomics results somewhere outside of MongoDB, or at least removing some level of detail from the records, like the qualified lists of all possible peptide identifications.

SamuelPurvine commented 3 months ago

@SamuelPurvine and I have been talking about saving proteomics results somewhere outside of MongoDB, or at least removing some level of detail from the records, like the qualified lists of all possible peptide identifications.

More directly, we already save the proteomics results outside of MongoDB, in the Peptide_Report and Protein_Report tsv files that are data objects produced by the workflow. We originally thought to put proteomics results into Mongo as there had been thought that these would be used by some NMDC to-be-developed aggregation tools that would pull those results from the DB, on the fly, to allow the user to "do some cool stuff in the portal".

A/the plan going forward is to pare the Analysis_activity results we report/load into MongoDB using json to just the BestProteins identified from a workflow run/instance, allowing a mild re-factoring of the aggregation table (removes the best_protein boolean which really ought to be is_best_protein to denote a question being answered), and an overhaul of the aggregation code to simply group the functional annotations associated with the BestProteins for a given workflow instance and count the number of BestProteins per functional annotation. This can also help drop the PeptideQuantitfication and ProteinQuantification classes and associated slots (who DOESN'T like dropping classes??).