Open luizirber opened 4 years ago
Especially interesting: the unresolved
/rehydrate
use case in the examples:
Download a compact package, also known as an unresolved bag, containing data reports and file locations only for all 29 primate RefSeq genomes, then retrieve the data when needed using rehydrate:
# First download the compact package (<10 MB) (unresolved bag), containing data reports and file locations for 29 primate (Taxonomy ID: 9443) RefSeq genomes
$ ./datasets download assembly tax-id 9443 --refseq --limit ALL --unresolved --filename primates_refseq_unresolved.zip
# Then unzip the unresolved bag
$ unzip primates_refseq_unresolved.zip
# When needed, use the rehydrate command to get the genome sequences for these assemblies (about 80 GB of data)
$ ./datasets rehydrate --filename .
Found 563 files for rehydration
1h4m51s [====================================================================] 100%
which might fit neatly with the discussion in https://github.com/dib-lab/sourmash/issues/985#issuecomment-626215382
I want to drop in a reference to frictionless data, too!
@taylorreiter mentioned the new
datasets
tool from NCBI on Slack, and the downloaded file is a zip file in the BDBag format. This is the content after running./datasets download assembly GCA_003583405.1
and unzipping the file:What would be needed to make sourmash databases into BDBag-compatible datasets?