Open joverlee521 opened 9 months ago
Hi @joverlee521,
Thanks for opening this issue. We are aware of this bug but haven't yet scheduled a time to get it fixed.
Alternatively, you can download a cached SARS-CoV-2 genome data package, a highly compressed archive containing all SARS-CoV-2 sequences, use grep
to identify sequences with the geographic location of interest, and pull out the sequences you want using samtools
.
When I last looked at this about a year ago, the grep
command below seemed to work well for narrowing down the list of genomes to those isolated from Washington state, but you may want to verify that this is still working well for you.
Here is what I suggest:
# Download all SARS-CoV-2 genomes
datasets download virus genome taxon sars-cov-2 --filename sars2.zip
# Extract all SARS-CoV-2 genome sequences from the downloaded zip archive
unzip -qc sars2.zip ncbi_dataset/data/genomic.fna > sars2-genomic.fna
# From the downloaded zip archive, use dataformat to generate a table of genome accessions and geo-location and filter for genomes from Washington state
dataformat tsv virus-genome --package sars2.zip --fields accession,geo-location | \
grep "USA: WA\|USA: Washington\|USA:.*[, ]WA$" | \
grep -v "ID\|Idaho\|DC\|DISTRICT OF COLUMBIA" > sars2-WA-clean.tsv
# Copy the accessions to a new file
cut -f1 sars2-WA-clean.tsv > sars2-WA-clean-acc.list
# Use samtools to copy the SARS-CoV-2 genomes from Washington state, from the file containing all SARS-CoV-2 genomes to a new file
samtools faidx --region-file sars2-WA-clean-acc.list --output sars2_WA_genomes.fna sars2-genomic.fna
I hope that helps.
Best, Eric
Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI eric.cox@nih.gov
I ran into the same issue for sars-cov-2 (and have gotten around it by downloading the full dataset as suggested) but wanted to note I've had no issues using the geo-location flag for mpox and other taxa. Do you know if the bug is specific to sars-cov-2 @ericcox1?
Hi @DOH-PXC5303,
I'm not aware of this issue affecting other taxa. This bug could be related to the large number of genome records that we have for SARS-CoV-2, which is currently at >8.7 M.
-Eric
Hi! I've been having this issue too when I run this command: datasets download virus genome taxon Viruses --complete-only --host human --geo-location Senegal --filename geo.zip Do you have any recommendations for how I may be able to get around the invalid zip archive? I'm confident it is not a space or connection issue. Thank you so much!!
Hi @skylarwalters,
We haven't yet had a chance to implement better support for geographic location filtering due to other institutional priorities.
In the meantime, here's an alternative workflow that you can try:
datasets download virus genome accession --inputfile sequences.acc --filename senegal-viruses.zip
Best, Eric
Hi Eric! Thank you so much for the help!!
Hi NCBI Datasets team,
Today I've tried a couple geolocations with the
--geo-location
flag and have run into theinvalid zip archive
error every time.My attempt with state level "WA"
``` $ ./datasets download virus genome taxon sars-cov-2 --geo-location WA --filename data/ncbi_dataset.zip --debug 2024/02/28 19:03:50 GET /datasets/v2alpha/taxonomy/taxon_suggest/sars-cov-2?exact_match=true&tax_rank_filter=higher_taxon&taxon_resource_filter=TAXON_RESOURCE_FILTER_ALL HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Accept: application/json Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9 X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location WA --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip 2024/02/28 19:03:51 HTTP/2.0 200 OK Content-Security-Policy: upgrade-insecure-requests Content-Type: application/json Date: Wed, 28 Feb 2024 19:03:51 GMT Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9.1.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block 2024/02/28 19:03:51 POST /datasets/v2alpha/taxonomy HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Content-Length: 53 Accept: application/json Content-Type: application/json Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9 X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location WA --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip {"returned_content":"METADATA","taxons":["2697049"]} 2024/02/28 19:03:51 HTTP/2.0 200 OK Content-Security-Policy: upgrade-insecure-requests Content-Type: application/json Date: Wed, 28 Feb 2024 19:03:51 GMT Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9.2.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block 2024/02/28 19:03:51 POST /datasets/v2alpha/virus/genome/download HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Content-Length: 189 Accept: application/zip Accept: application/json Content-Type: application/json Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9 X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location WA --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip {"annotated_only":false,"complete_only":false,"format":"tsv","geo_location":"WA","host":"","include_sequence":["GENOME"],"pangolin_classification":"","refseq_only":false,"taxon":"2697049"} 2024/02/28 19:03:51 HTTP/2.0 200 OK Content-Disposition: attachment; filename=ncbi_dataset.zip Content-Security-Policy: upgrade-insecure-requests Content-Type: application/zip Date: Wed, 28 Feb 2024 19:03:51 GMT Grpc-Metadata-Logging-Activity: download Grpc-Metadata-Logging-Annotated_only: False Grpc-Metadata-Logging-Refseq_only: False Grpc-Metadata-Logging-Service: virus Grpc-Metadata-Logging-Taxon: 2697049 Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9.3.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload Vary: Accept-Encoding X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block Downloading: data/ncbi_dataset.zip 112kB done Downloading: data/ncbi_dataset.zip 112kB invalid zip archive Validating package [] Use datasets download virus genome taxonMy attempt with country level "USA"
``` $ ./datasets download virus genome taxon sars-cov-2 --geo-location USA --filename data/ncbi_dataset.zip --debug 2024/02/28 18:50:55 GET /datasets/v2alpha/taxonomy/taxon_suggest/sars-cov-2?exact_match=true&tax_rank_filter=higher_taxon&taxon_resource_filter=TAXON_RESOURCE_FILTER_ALL HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Accept: application/json Ncbi-Phid: 76BF10892A975A708F9C4692 X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location USA --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip 2024/02/28 18:50:56 HTTP/2.0 200 OK Content-Security-Policy: upgrade-insecure-requests Content-Type: application/json Date: Wed, 28 Feb 2024 18:50:56 GMT Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: 76BF10892A975A708F9C4692.1.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block 2024/02/28 18:50:56 POST /datasets/v2alpha/taxonomy HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Content-Length: 53 Accept: application/json Content-Type: application/json Ncbi-Phid: 76BF10892A975A708F9C4692 X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location USA --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip {"returned_content":"METADATA","taxons":["2697049"]} 2024/02/28 18:50:56 HTTP/2.0 200 OK Content-Security-Policy: upgrade-insecure-requests Content-Type: application/json Date: Wed, 28 Feb 2024 18:50:56 GMT Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: 76BF10892A975A708F9C4692.2.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block 2024/02/28 18:50:56 POST /datasets/v2alpha/virus/genome/download HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Content-Length: 190 Accept: application/zip Accept: application/json Content-Type: application/json Ncbi-Phid: 76BF10892A975A708F9C4692 X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location USA --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip {"annotated_only":false,"complete_only":false,"format":"tsv","geo_location":"USA","host":"","include_sequence":["GENOME"],"pangolin_classification":"","refseq_only":false,"taxon":"2697049"} 2024/02/28 18:50:56 HTTP/2.0 200 OK Content-Disposition: attachment; filename=ncbi_dataset.zip Content-Security-Policy: upgrade-insecure-requests Content-Type: application/zip Date: Wed, 28 Feb 2024 18:50:56 GMT Grpc-Metadata-Logging-Activity: download Grpc-Metadata-Logging-Annotated_only: False Grpc-Metadata-Logging-Refseq_only: False Grpc-Metadata-Logging-Service: virus Grpc-Metadata-Logging-Taxon: 2697049 Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: 76BF10892A975A708F9C4692.3.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload Vary: Accept-Encoding X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block Downloading: data/ncbi_dataset.zip 8.25MB done Downloading: data/ncbi_dataset.zip 8.25MB invalid zip archive Validating package [] Use datasets download virus genome taxonMy attempt with continent level "Africa"
``` $ ./datasets download virus genome taxon sars-cov-2 --geo-location Africa --filename data/ncbi_dataset.zip --debug 2024/02/28 19:02:41 GET /datasets/v2alpha/taxonomy/taxon_suggest/sars-cov-2?exact_match=true&tax_rank_filter=higher_taxon&taxon_resource_filter=TAXON_RESOURCE_FILTER_ALL HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Accept: application/json Ncbi-Phid: E35746682FB5DDAAA893F10F X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location Africa --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip 2024/02/28 19:02:42 HTTP/2.0 200 OK Content-Security-Policy: upgrade-insecure-requests Content-Type: application/json Date: Wed, 28 Feb 2024 19:02:42 GMT Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: E35746682FB5DDAAA893F10F.1.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block 2024/02/28 19:02:42 POST /datasets/v2alpha/taxonomy HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Content-Length: 53 Accept: application/json Content-Type: application/json Ncbi-Phid: E35746682FB5DDAAA893F10F X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location Africa --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip {"returned_content":"METADATA","taxons":["2697049"]} 2024/02/28 19:02:42 HTTP/2.0 200 OK Content-Security-Policy: upgrade-insecure-requests Content-Type: application/json Date: Wed, 28 Feb 2024 19:02:42 GMT Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: E35746682FB5DDAAA893F10F.2.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block 2024/02/28 19:02:42 POST /datasets/v2alpha/virus/genome/download HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/16.6.0/go Content-Length: 193 Accept: application/zip Accept: application/json Content-Type: application/json Ncbi-Phid: E35746682FB5DDAAA893F10F X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location Africa --filename data/ncbi_dataset.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.6.0 Accept-Encoding: gzip {"annotated_only":false,"complete_only":false,"format":"tsv","geo_location":"Africa","host":"","include_sequence":["GENOME"],"pangolin_classification":"","refseq_only":false,"taxon":"2697049"} 2024/02/28 19:02:42 HTTP/2.0 200 OK Content-Disposition: attachment; filename=ncbi_dataset.zip Content-Security-Policy: upgrade-insecure-requests Content-Type: application/zip Date: Wed, 28 Feb 2024 19:02:42 GMT Grpc-Metadata-Logging-Activity: download Grpc-Metadata-Logging-Annotated_only: False Grpc-Metadata-Logging-Refseq_only: False Grpc-Metadata-Logging-Service: virus Grpc-Metadata-Logging-Taxon: 2697049 Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: E35746682FB5DDAAA893F10F.3.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload Vary: Accept-Encoding X-Datasets-Version: 16.6.0 X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block Downloading: data/ncbi_dataset.zip 855B invalid zip archive Downloading: data/ncbi_dataset.zip 855B invalid zip archive Validating package [] Use datasets download virus genome taxon