sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

Make sourmash databases BDBag-compatible? #991

Open luizirber opened 4 years ago

luizirber commented 4 years ago

@taylorreiter mentioned the new datasets tool from NCBI on Slack, and the downloaded file is a zip file in the BDBag format. This is the content after running ./datasets download assembly GCA_003583405.1 and unzipping the file:

.
├── bag-info.txt
├── bagit.txt
├── data
│   ├── dataset_catalog.json
│   └── GCA_003583405.1
│       ├── data_report.yaml
│       └── GCA_003583405.1_CHULA_Jazt_1.1_for_version_1.1_of_the_Jishengella_sp._nov._AZ1-13_genome_from_a_lab_in_CHULA_genomic.fna
├── fetch.txt
├── manifest-md5.txt
└── tagmanifest-md5.txt

What would be needed to make sourmash databases into BDBag-compatible datasets?

luizirber commented 4 years ago

Especially interesting: the unresolved/rehydrate use case in the examples:

Download a compact package, also known as an unresolved bag, containing data reports and file locations only for all 29 primate RefSeq genomes, then retrieve the data when needed using rehydrate:

# First download the compact package (<10 MB) (unresolved bag), containing data reports and file locations for 29 primate (Taxonomy ID: 9443) RefSeq genomes
$ ./datasets download assembly tax-id 9443 --refseq --limit ALL --unresolved --filename primates_refseq_unresolved.zip

# Then unzip the unresolved bag
$ unzip primates_refseq_unresolved.zip

# When needed, use the rehydrate command to get the genome sequences for these assemblies (about 80 GB of data)
$ ./datasets rehydrate --filename .
Found 563 files for rehydration
1h4m51s [====================================================================] 100%

which might fit neatly with the discussion in https://github.com/dib-lab/sourmash/issues/985#issuecomment-626215382

ctb commented 3 years ago

I want to drop in a reference to frictionless data, too!