ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
356 stars 39 forks source link

--geo-location flag runs into invalid zip archive error #326

Open joverlee521 opened 7 months ago

joverlee521 commented 7 months ago

Hi NCBI Datasets team,

Today I've tried a couple geolocations with the --geo-location flag and have run into the invalid zip archive error every time.

My attempt with state level "WA" ``` $ ./datasets download virus genome taxon sars-cov-2 --geo-location WA --filename data/ncbi_dataset.zip --debug 2024/02/28 19:03:50 GET /datasets/v2alpha/taxonomy/taxon_suggest/sars-cov-2?exact_match=true&tax_rank_filter=higher_taxon&taxon_resource_filter=TAXON_RESOURCE_FILTER_ALL HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Accept: application/json Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9 X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location WA --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip 2024/02/28 19:03:51 HTTP/2.0 200 OK Content-Security-Policy: upgrade-insecure-requests Content-Type: application/json Date: Wed, 28 Feb 2024 19:03:51 GMT Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9.1.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block 2024/02/28 19:03:51 POST /datasets/v2alpha/taxonomy HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Content-Length: 53 Accept: application/json Content-Type: application/json Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9 X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location WA --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip {"returned_content":"METADATA","taxons":["2697049"]} 2024/02/28 19:03:51 HTTP/2.0 200 OK Content-Security-Policy: upgrade-insecure-requests Content-Type: application/json Date: Wed, 28 Feb 2024 19:03:51 GMT Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9.2.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block 2024/02/28 19:03:51 POST /datasets/v2alpha/virus/genome/download HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Content-Length: 189 Accept: application/zip Accept: application/json Content-Type: application/json Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9 X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location WA --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip {"annotated_only":false,"complete_only":false,"format":"tsv","geo_location":"WA","host":"","include_sequence":["GENOME"],"pangolin_classification":"","refseq_only":false,"taxon":"2697049"} 2024/02/28 19:03:51 HTTP/2.0 200 OK Content-Disposition: attachment; filename=ncbi_dataset.zip Content-Security-Policy: upgrade-insecure-requests Content-Type: application/zip Date: Wed, 28 Feb 2024 19:03:51 GMT Grpc-Metadata-Logging-Activity: download Grpc-Metadata-Logging-Annotated_only: False Grpc-Metadata-Logging-Refseq_only: False Grpc-Metadata-Logging-Service: virus Grpc-Metadata-Logging-Taxon: 2697049 Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9.3.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload Vary: Accept-Encoding X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block Downloading: data/ncbi_dataset.zip 112kB done Downloading: data/ncbi_dataset.zip 112kB invalid zip archive Validating package [] Use datasets download virus genome taxon --help for detailed help about a command. ```
My attempt with country level "USA" ``` $ ./datasets download virus genome taxon sars-cov-2 --geo-location USA --filename data/ncbi_dataset.zip --debug 2024/02/28 18:50:55 GET /datasets/v2alpha/taxonomy/taxon_suggest/sars-cov-2?exact_match=true&tax_rank_filter=higher_taxon&taxon_resource_filter=TAXON_RESOURCE_FILTER_ALL HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Accept: application/json Ncbi-Phid: 76BF10892A975A708F9C4692 X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location USA --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip 2024/02/28 18:50:56 HTTP/2.0 200 OK Content-Security-Policy: upgrade-insecure-requests Content-Type: application/json Date: Wed, 28 Feb 2024 18:50:56 GMT Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: 76BF10892A975A708F9C4692.1.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block 2024/02/28 18:50:56 POST /datasets/v2alpha/taxonomy HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Content-Length: 53 Accept: application/json Content-Type: application/json Ncbi-Phid: 76BF10892A975A708F9C4692 X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location USA --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip {"returned_content":"METADATA","taxons":["2697049"]} 2024/02/28 18:50:56 HTTP/2.0 200 OK Content-Security-Policy: upgrade-insecure-requests Content-Type: application/json Date: Wed, 28 Feb 2024 18:50:56 GMT Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: 76BF10892A975A708F9C4692.2.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block 2024/02/28 18:50:56 POST /datasets/v2alpha/virus/genome/download HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Content-Length: 190 Accept: application/zip Accept: application/json Content-Type: application/json Ncbi-Phid: 76BF10892A975A708F9C4692 X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location USA --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip {"annotated_only":false,"complete_only":false,"format":"tsv","geo_location":"USA","host":"","include_sequence":["GENOME"],"pangolin_classification":"","refseq_only":false,"taxon":"2697049"} 2024/02/28 18:50:56 HTTP/2.0 200 OK Content-Disposition: attachment; filename=ncbi_dataset.zip Content-Security-Policy: upgrade-insecure-requests Content-Type: application/zip Date: Wed, 28 Feb 2024 18:50:56 GMT Grpc-Metadata-Logging-Activity: download Grpc-Metadata-Logging-Annotated_only: False Grpc-Metadata-Logging-Refseq_only: False Grpc-Metadata-Logging-Service: virus Grpc-Metadata-Logging-Taxon: 2697049 Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: 76BF10892A975A708F9C4692.3.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload Vary: Accept-Encoding X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block Downloading: data/ncbi_dataset.zip 8.25MB done Downloading: data/ncbi_dataset.zip 8.25MB invalid zip archive Validating package [] Use datasets download virus genome taxon --help for detailed help about a command. ```
My attempt with continent level "Africa" ``` $ ./datasets download virus genome taxon sars-cov-2 --geo-location Africa --filename data/ncbi_dataset.zip --debug 2024/02/28 19:02:41 GET /datasets/v2alpha/taxonomy/taxon_suggest/sars-cov-2?exact_match=true&tax_rank_filter=higher_taxon&taxon_resource_filter=TAXON_RESOURCE_FILTER_ALL HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Accept: application/json Ncbi-Phid: E35746682FB5DDAAA893F10F X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location Africa --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip 2024/02/28 19:02:42 HTTP/2.0 200 OK Content-Security-Policy: upgrade-insecure-requests Content-Type: application/json Date: Wed, 28 Feb 2024 19:02:42 GMT Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: E35746682FB5DDAAA893F10F.1.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block 2024/02/28 19:02:42 POST /datasets/v2alpha/taxonomy HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Content-Length: 53 Accept: application/json Content-Type: application/json Ncbi-Phid: E35746682FB5DDAAA893F10F X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location Africa --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip {"returned_content":"METADATA","taxons":["2697049"]} 2024/02/28 19:02:42 HTTP/2.0 200 OK Content-Security-Policy: upgrade-insecure-requests Content-Type: application/json Date: Wed, 28 Feb 2024 19:02:42 GMT Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: E35746682FB5DDAAA893F10F.2.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block 2024/02/28 19:02:42 POST /datasets/v2alpha/virus/genome/download HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Content-Length: 193 Accept: application/zip Accept: application/json Content-Type: application/json Ncbi-Phid: E35746682FB5DDAAA893F10F X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location Africa --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip {"annotated_only":false,"complete_only":false,"format":"tsv","geo_location":"Africa","host":"","include_sequence":["GENOME"],"pangolin_classification":"","refseq_only":false,"taxon":"2697049"} 2024/02/28 19:02:42 HTTP/2.0 200 OK Content-Disposition: attachment; filename=ncbi_dataset.zip Content-Security-Policy: upgrade-insecure-requests Content-Type: application/zip Date: Wed, 28 Feb 2024 19:02:42 GMT Grpc-Metadata-Logging-Activity: download Grpc-Metadata-Logging-Annotated_only: False Grpc-Metadata-Logging-Refseq_only: False Grpc-Metadata-Logging-Service: virus Grpc-Metadata-Logging-Taxon: 2697049 Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: E35746682FB5DDAAA893F10F.3.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload Vary: Accept-Encoding X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block Downloading: data/ncbi_dataset.zip 855B invalid zip archive Downloading: data/ncbi_dataset.zip 855B invalid zip archive Validating package [] Use datasets download virus genome taxon --help for detailed help about a command. ```
ericcox1 commented 7 months ago

Hi @joverlee521,

Thanks for opening this issue. We are aware of this bug but haven't yet scheduled a time to get it fixed.

Alternatively, you can download a cached SARS-CoV-2 genome data package, a highly compressed archive containing all SARS-CoV-2 sequences, use grep to identify sequences with the geographic location of interest, and pull out the sequences you want using samtools.

When I last looked at this about a year ago, the grep command below seemed to work well for narrowing down the list of genomes to those isolated from Washington state, but you may want to verify that this is still working well for you.

Here is what I suggest:

# Download all SARS-CoV-2 genomes
datasets download virus genome taxon sars-cov-2 --filename sars2.zip

# Extract all SARS-CoV-2 genome sequences from the downloaded zip archive
unzip -qc sars2.zip ncbi_dataset/data/genomic.fna > sars2-genomic.fna

# From the downloaded zip archive, use dataformat to generate a table of genome accessions and geo-location and filter for genomes from Washington state
dataformat tsv virus-genome --package sars2.zip --fields accession,geo-location | \
grep "USA: WA\|USA: Washington\|USA:.*[, ]WA$" | \
grep -v "ID\|Idaho\|DC\|DISTRICT OF COLUMBIA" > sars2-WA-clean.tsv

# Copy the accessions to a new file
cut -f1 sars2-WA-clean.tsv > sars2-WA-clean-acc.list

# Use samtools to copy the SARS-CoV-2 genomes from Washington state, from the file containing all SARS-CoV-2 genomes to a new file
samtools faidx --region-file sars2-WA-clean-acc.list --output sars2_WA_genomes.fna sars2-genomic.fna 

I hope that helps.

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI eric.cox@nih.gov

DOH-PXC5303 commented 6 months ago

I ran into the same issue for sars-cov-2 (and have gotten around it by downloading the full dataset as suggested) but wanted to note I've had no issues using the geo-location flag for mpox and other taxa. Do you know if the bug is specific to sars-cov-2 @ericcox1?

ericcox1 commented 6 months ago

Hi @DOH-PXC5303,

I'm not aware of this issue affecting other taxa. This bug could be related to the large number of genome records that we have for SARS-CoV-2, which is currently at >8.7 M.

-Eric

skylarwalters commented 2 months ago

Hi! I've been having this issue too when I run this command: datasets download virus genome taxon Viruses --complete-only --host human --geo-location Senegal --filename geo.zip Do you have any recommendations for how I may be able to get around the invalid zip archive? I'm confident it is not a space or connection issue. Thank you so much!!

ericcox1 commented 2 months ago

Hi @skylarwalters,

We haven't yet had a chance to implement better support for geographic location filtering due to other institutional priorities.

In the meantime, here's an alternative workflow that you can try:

  1. Download the list of nucleotide accessions (with versions) representing virus genome sequences isolated in Senegal from the NCBI Virus web page
  2. Use this downloaded list of nucleotide accessions with the datasets CLI to download the genome sequences, for example: datasets download virus genome accession --inputfile sequences.acc --filename senegal-viruses.zip

Best, Eric

skylarwalters commented 2 months ago

Hi Eric! Thank you so much for the help!!