review id_locus pattern

aclum commented 4 weeks ago

To do when reviewing proteomics collections. Currently there is not an enforced pattern for best_protein or all_proteins

Note: for a subset of annotation files the gene IDs use nmdc:wfmgas instead of nmdc:wfmgan

cc @picowatt @SamuelPurvine @turbomam

SamuelPurvine commented 4 weeks ago

Please keep in mind we are also intending to add annotations that aren't NMDC derived when using the version 2 pipeline. Currently we are planning to use Uniprot annotations, but we shouldn't expect to limit ourselves to only that repository. Given the wild west of protein naming I've encountered over the past 20 years, trying to enforce a naming structure/patterns will end in misery, but we can always add that next standard (xkcd standards comic reference) :)

all_proteins is going to firmly and happily go away in the schema as soon as we can tackle the bloat, which comes after implementation of the refactored schema, which comes after re-id-ing the proteomics data, which comes after the metagenome annotations have completed...

There's also discussion to re-name best_protein to something that is more descriptive, such as most_confidently_associated_protein or similar ilk, and add a slot that defines and describes how that association was made (currently parsimony, others will likely come online).

aclum commented 4 weeks ago

At the metap meeting on 6/4/24 we discussed a minimum constraint of a curie and a max constraint of a nmdc wf identifier + uniprot.

SamuelPurvine commented 4 weeks ago

Current plan for Uniprot IDs (incorporated in the current Kaiko implementation) would be to use their full string, such as "tr|A0A1D5Q1C9|A0A1D5Q1C9_MACMU" or "sp|A1L190|SYCE3_HUMAN" which helps denote the three elements of sequence source (TrEMBL or SWISS-Prot), the Entry ID, and Entry Name, each separated by a pipe. One supposes adding a prefix of "uniprot:" might curie these adequately?

aclum commented 3 weeks ago

@SamuelPurvine The existing documentation about uniprot prefix registration is https://bioregistry.io/registry/uniprot NMDC has the prefix of UniProtKB to expand to https://bioregistry.io/uniprot So for "sp|A1L190|SYCE3_HUMAN" the code that makes the json file for the schema that contains the value for best_protein/occam_protein/prefered_slot_name would be UniProtKB:Entry ID, example UniProtKB:A1L190

SamuelPurvine commented 3 weeks ago

OK, very cool, we should certainly be able to accommodate that. Is there / will there be machinery to apply functional annotations for unirprot entries, or do "we" (the Kaiko team as implemented through the proteomics workflow) need to provide that to allow the portal to show Kaiko search results? There's probably more packed into that question (like who will end up making that aggregation table and populating it... and how?... and when?) than easily fits here, but thought I'd ask!

aclum commented 3 weeks ago

The only functional annotation supported now is KEGG so this would require development in the data portal. The workflow should not generate the aggregation, if needed aggregation would be written separately. If Cam is up for maintaining b/c he'd have knowledge of the workflow itself that would be good. The aggregation codes now have their own repository, https://github.com/microbiomedata/nmdc-aggregator

microbiomedata / nmdc-schema

review id_locus pattern #2028