ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
347 stars 39 forks source link

Rabies genome package includes invalid host taxon ids #385

Open joverlee521 opened 1 month ago

joverlee521 commented 1 month ago

Describe the bug

The rabies genome package includes invalid host taxon ids that cannot be fetched as a taxonomy package.

To Reproduce

  1. Download rabies virus genome package with dataset
    datasets download virus genome taxon 11292
  2. Parse to TSV format with dataformat
    dataformat tsv virus-genome             --package ncbi_dataset.zip > ncbi_dataset.tsv
  3. Create file with unique host taxon ids
    tsv-select ncbi_dataset.tsv --fields 17 | tsv-filter --is-numeric 1 | tsv-uniq > ncbi_taxon_ids.txt
  4. Use the file to download taxonomy package with dataset and see warnings
    
    $ datasets download taxonomy taxon --inputfile ncbi_taxon_ids.txt 
    The taxonomy ID '3044320' does not match any existing taxids for 'taxonomy'

The taxonomy ID '1935980' does not match any existing taxids for 'taxonomy'

The taxonomy ID '3041509' does not match any existing taxids for 'taxonomy'



**Expected behavior**

I'd expect all host taxon ids returned by NCBI Datasets to be valid taxonomy ids. 
ericcox1 commented 1 month ago

Hi @joverlee521,

Thanks for opening this issue. I can confirm that this is a bug. We will investigate and post any updates to this thread.

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets NIH/NLM/NCBI eric.cox@nih.gov