Closed sdgamboa closed 2 years ago
Hi Samuel,
I don't know if this has been resolved yet, but some species do not have an NCBI ID based on how they were isolated (e.g. isolated genome from soil) Therefore, we only have genome ID. PATRIC only uses genome IDs, and some papers only use accession IDs. :/ Not ideal, and I'm not sure how to get around that.
kelly
Hello, Kelly. The current solution would be to use the genome_id column only for ids coming from the genome database of the NCBI (GCF/GCA) and the accession_id column for ids from the nucleotide database (nuccore). There would be additional columns, e.g. PATRIC_ID, MiDAS_ID, SRA_ID, etc. All these columns will end in *_ID and be documented in a vignette. If an entry doesn't have any data for a column, we would just fill it with NAs. Hope this makes sense.
Sorry I didn't post the above comment in this issue before. I think we could close this issue now?
That makes total sense. We can close the issue.
@lwaldron, should the values in the Genome_ID column come from a specific source (e.g., PATRIC), or can they come from different sources (both PATRIC and NCBI_ID)? Or should these Ids come from the SRA (table below)? What's the difference with the Accession_ID column?
Currently, only the Genome_ID column in the "antimicrobial resistance" dataset is populated with PATRIC genome IDs (e.g, 1773.1063), and only the Accession_ID column of the "isolation site" and "disease association" datasets are populated with IDs--I think this is because these are the only datasets with sequenced strains.
The Accession_ID column contains a mix of NCBI identifiers from different databases (Genome, SRA, nucleotide) and different features (e.g., reads, genome, and accessory plasmids). Examples:
Created on 2021-11-04 by the reprex package (v2.0.1)