microbiomedata / issues

public repo for issues related to NMDC work
2 stars 1 forks source link

Fix existing `MetaproteomicsAnalysis` records to have compliant ranges for protein slots #897

Open kheal opened 2 weeks ago

kheal commented 2 weeks ago

Current Behavior

Some MetaProteomics records have non-compliant values in the the best_protein and all_protein slots. The range of these slots is a GeneProduct, which have a range of id as a uriorcurie. See this file for full list of non uriorcurie values

Expected Behavior

NMDC uriorcurie slots should be populated by a CURIe, with a prefix, a colon and a local identifier, like nmdc:wfmgan-11-pmh0a992.1_0000691_21398_23068.

Steps To Reproduce

See below for R script to find these non-compliant values in mongo.

# Load essential libraries
library(jsonlite)
library(tidyverse)

# Pull all the MetaP data in production mongo
og_url <- 'https://api.microbiomedata.org/nmdcschema/metaproteomics_analysis_activity_set?&max_page_size=100'
response <- jsonlite::fromJSON(URLencode(og_url, repeated = TRUE))
ids <- response$resources$id

# Check that there are 52 unique ids
if (length(unique(ids)) != 52){
  print('We are missing ids!')
}

# Pull out the in-lined peptide quantification fields
pep_quans <- response$resources$has_peptide_quantifications

# Loop through each of the peptide quantifications and pull out the protein ids
protein_ids <- c()
for (i in 1:length(pep_quans)){
  protein_ids <- c(unique(unlist(pep_quans[[i]]$all_proteins)), protein_ids)
}
protein_ids <- unique(protein_ids)

# Filter out any peptide ids that doesn't start with nmdc:
non_compliant_ids <- as.data.frame(protein_ids) %>%
    filter(!str_detect(protein_ids, '^nmdc:'))

# Save the non-compliant protein ids
write.table(non_compliant_ids, file = 'non_compliant_protein_ids.txt', row.names = FALSE, col.names = FALSE)

Notes

Closing this issue will unblock https://github.com/microbiomedata/nmdc-schema/issues/2028

kheal commented 1 week ago

Team proteomics have decided to remove these contaminants from the existing mongo records and will fix the source files for the uniprot mapping for future use of the workflow.

SamuelPurvine commented 1 week ago

Entries of interest are: Contaminant_TRYP_PIG Contaminant_Trypa1 Contaminant_Trypa2 Contaminant_Trypa3 Contaminant_Trypa4 Contaminant_Trypa5 Contaminant_Trypa6 Contaminant_TRYP_BOVIN Contaminant_CTRA_BOVIN Contaminant_CTRB_BOVIN Contaminant_ALBU_HUMAN Contaminant_ALBU_BOVIN Contaminant_K2C1_HUMAN Contaminant_K22E_HUMAN Contaminant_K1C9_HUMAN Contaminant_K1C10_HUMAN