waldronlab / bugphyzz

Harmonized annotation of microbial physiology
http://waldronlab.io/bugphyzz/
5 stars 5 forks source link

Which value should be in Genome_ID and Accession_ID #133

Closed sdgamboa closed 2 years ago

sdgamboa commented 2 years ago

@lwaldron, should the values in the Genome_ID column come from a specific source (e.g., PATRIC), or can they come from different sources (both PATRIC and NCBI_ID)? Or should these Ids come from the SRA (table below)? What's the difference with the Accession_ID column?

Currently, only the Genome_ID column in the "antimicrobial resistance" dataset is populated with PATRIC genome IDs (e.g, 1773.1063), and only the Accession_ID column of the "isolation site" and "disease association" datasets are populated with IDs--I think this is because these are the only datasets with sequenced strains.

The Accession_ID column contains a mix of NCBI identifiers from different databases (Genome, SRA, nucleotide) and different features (e.g., reads, genome, and accessory plasmids). Examples:

The current descriptions of these columns in the manuscript draft are: Column Description
Genome ID This ID number is from the NCBI that is associated with genomic reads, and can be used as an alternate ID number for each row.
Accession_ID Unique sequence identifier in NCBI.
suppressMessages({
  library(bugphyzz)
  library(purrr)
  phys <- physiologies()
})

map(phys, ~ head(unique(.x$Genome_ID)))
#> $`animal pathogen`
#> [1] NA
#> 
#> $`antimicrobial resistance`
#> [1] 287.4300 287.4396 287.4395 287.4317 287.4330 287.4402
#> 
#> $`antimicrobial sensitivity`
#> [1] NA
#> 
#> $`biofilm forming`
#> [1] NA
#> 
#> $`butyrate producing`
#> [1] NA
#> 
#> $`acetate producing`
#> [1] NA
#> 
#> $`lactate producing`
#> [1] NA
#> 
#> $arrangement
#> [1] "Unknown"
#> 
#> $shape
#> [1] NA
#> 
#> $`COGEM pathogenicity rating`
#> [1] NA
#> 
#> $`mutation rate per site per generation`
#> [1] NA
#> 
#> $`mutation rates per site per year`
#> [1] NA
#> 
#> $`extreme environment`
#> [1] NA
#> 
#> $`gram stain`
#> [1] "Unknown"
#> 
#> $`growth medium`
#> [1] NA
#> 
#> $`growth temperature`
#> [1] "Unknown"
#> 
#> $habitat
#> [1] "Unknown" ""       
#> 
#> $`optimal ph`
#> [1] NA
#> 
#> $aerophilicity
#> [1] NA
#> 
#> $`plant pathogenicity`
#> [1] NA
#> 
#> $width
#> [1] NA
#> 
#> $`spore shape`
#> [1] NA
#> 
#> $`isolation site`
#> [1] NA
#> 
#> $`disease association`
#> [1] NA
#> 
#> $`hydrogen gas producing`
#> [1] "Unknown"
#> 
#> $length
#> [1] "Unknown"
#> 
#> $`health associated`
#> [1] NA
map(phys, ~head(unique(.x$Accession_ID), 1))
#> $`animal pathogen`
#> [1] NA
#> 
#> $`antimicrobial resistance`
#> [1] NA
#> 
#> $`antimicrobial sensitivity`
#> [1] NA
#> 
#> $`biofilm forming`
#> [1] NA
#> 
#> $`butyrate producing`
#> [1] NA
#> 
#> $`acetate producing`
#> [1] NA
#> 
#> $`lactate producing`
#> [1] NA
#> 
#> $arrangement
#> [1] "Unknown"
#> 
#> $shape
#> [1] NA
#> 
#> $`COGEM pathogenicity rating`
#> [1] NA
#> 
#> $`mutation rate per site per generation`
#> [1] NA
#> 
#> $`mutation rates per site per year`
#> [1] NA
#> 
#> $`extreme environment`
#> [1] NA
#> 
#> $`gram stain`
#> [1] "Unknown"
#> 
#> $`growth medium`
#> [1] NA
#> 
#> $`growth temperature`
#> [1] "Unknown"
#> 
#> $habitat
#> [1] "Unknown"
#> 
#> $`optimal ph`
#> [1] NA
#> 
#> $aerophilicity
#> [1] NA
#> 
#> $`plant pathogenicity`
#> [1] NA
#> 
#> $width
#> [1] NA
#> 
#> $`spore shape`
#> [1] ""
#> 
#> $`isolation site`
#> [1] "NC_002937, NC_005863"
#> 
#> $`disease association`
#> [1] "NC_010399, NC_010407, NC_010408"
#> 
#> $`hydrogen gas producing`
#> [1] "Unknown"
#> 
#> $length
#> [1] "Unknown"
#> 
#> $`health associated`
#> [1] NA

Created on 2021-11-04 by the reprex package (v2.0.1)

kbeckenrode commented 2 years ago

Hi Samuel,

I don't know if this has been resolved yet, but some species do not have an NCBI ID based on how they were isolated (e.g. isolated genome from soil) Therefore, we only have genome ID. PATRIC only uses genome IDs, and some papers only use accession IDs. :/ Not ideal, and I'm not sure how to get around that.

kelly

sdgamboa commented 2 years ago

Hello, Kelly. The current solution would be to use the genome_id column only for ids coming from the genome database of the NCBI (GCF/GCA) and the accession_id column for ids from the nucleotide database (nuccore). There would be additional columns, e.g. PATRIC_ID, MiDAS_ID, SRA_ID, etc. All these columns will end in *_ID and be documented in a vignette. If an entry doesn't have any data for a column, we would just fill it with NAs. Hope this makes sense.

Sorry I didn't post the above comment in this issue before. I think we could close this issue now?

kbeckenrode commented 2 years ago

That makes total sense. We can close the issue.