theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

[Assembly_fetch] Add support for downloading from NCBI Virus database #607

Open kapsakcj opened 2 months ago

kapsakcj commented 2 months ago

:cool:

:pushpin: Explain the Request

Some GenBank accessions are unable to be downloaded via the command we currently have in the Assembly_fetch workflow:

datasets download genome accession ~{ncbi_accession}

In code here: https://github.com/theiagen/public_health_bioinformatics/blob/5be343354f716d77e9e4a0fb4a2ec10eb3bc00a5/tasks/utilities/data_import/task_ncbi_datasets.wdl#L27C5-L28C24

For example, with this accession, OM900516.2, it fails with this message:

$ datasets download genome accession OM900516.2  --filename OM900516.zip  --assembly-version latest   --include genome
Error: invalid or unsupported assembly accession: OM900516

Use datasets download genome accession <command> --help for detailed help about a command.

The reason being is that these kinds of accession are only accessible through the NCBI Virus data package, so you have to specify a different sub-command to download the genome (& other associated files)

This command works:

$ datasets download virus genome accession OM900516.2  --filename OM900516.2.zip --include genome
New version of client (16.27.2) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
Downloading: OM900516.2.zip    15.1kB valid zip structure -- files not checked
Validating package [================================================] 100% 5/5

I've started a dev branch called cjk-assembly-fetch for this a long time ago but it was left by the wayside as other higher priorities arose.

It would be good to continue making commits to this branch and add in support more completely. Things that need to be done:

kapsakcj commented 2 months ago

@emily-smith1 had success with the dev branch as it stands today: https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/dataAnalysis_SARS-CoV-2_CA-CDC/job_history/6cd834cc-aa53-4d40-8a27-4554edcbae7b

I'm glad we didn't delete this branch! pats self on back 😄