ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
364 stars 39 forks source link

Possibility to add --released-before to datasets download virus genome/protein ? #52

Closed mvdbeek closed 1 year ago

mvdbeek commented 3 years ago

So we can get datasets from a specific day. Otherwise it's tricky to write a deterministic test that doesn't increase in runtime while new datasets are being added.

ghost commented 1 year ago

What would currently be the best way to obtain genomes/protein before a specific day without the --released-before function? Simply download more than you need and then filter?

ericcox1 commented 1 year ago

Hi snddns,

If you're interested in downloading virus genomes released before a specific date, you could try the following:

  1. Use datasets to get metadata for the virus taxonomic group that you're interested in
  2. Use jq to filter the metadata for genomes release before a particular date, and get the corresponding accessions
  3. Now use datasets again to download data for the the list of accessions

Here's an example:

# Generate a table of monkeypox genomes released before 2005
datasets summary virus genome taxon monkeypox --as-json-lines | \
jq -r 'select(.release_date < "2005") | [.accession,.virus.organism_name,.release_date] | @tsv' > monkeypox-genomes.tsv

# Get the accession list from the table
cut -f1 monkeypox-genomes.tsv > accession.list

# Download the genomes 
datasets download virus genome accession --inputfile accession.list --filename monkeypox.zip

I hope that helps. Please let me know if you have any questions.

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI eric.cox@nih.gov