ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
364 stars 39 forks source link

Could not download virus genome #277

Closed cchapus closed 11 months ago

cchapus commented 12 months ago

Hello, I've been trying to download monkeypox genome for the last three days. I'm using the nextstrain pipeline (https://github.com/nextstrain/monkeypox/tree/master/ingest). It uses the following command:

datasets download virus genome taxon 10244 --no-progressbar --filename data/ncbi_dataset.zip

I've removed the --no-progressbar --filename data/ncbi_dataset.zip for debugging purpose.

I've tried at least 50 times over three days to download the dataset without success. With my work ISP, I've always got the same issue at the same time:

Downloading: ncbi_dataset.zip    43.4MB 418kB/s
Error: Internal error (invalid zip archive). Please try again

Use datasets download virus genome taxon <command> --help for detailed help about a command.

So I've tried the --debug option like proposed in closed issue (it seems this issue is frequent).

GET /datasets/v2alpha/taxonomy/taxon/10244 HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/15.24.0/go
Accept: application/json
Ncbi-Phid: D909E87BA8639A64A00CB84C
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download virus genome taxon 10244 --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 15.24.0
Accept-Encoding: gzip

2023/10/25 09:48:17 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Wed, 25 Oct 2023 07:48:17 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: D909E87BA8639A64A00CB84C.1.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 15.24.0
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block

2023/10/25 09:48:17 
POST /datasets/v2alpha/virus/genome/download HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/15.24.0/go
Content-Length: 185
Accept: application/zip
Accept: application/json
Content-Type: application/json
Ncbi-Phid: D909E87BA8639A64A00CB84C
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download virus genome taxon 10244 --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 15.24.0
Accept-Encoding: gzip

{"annotated_only":false,"complete_only":false,"format":"tsv","geo_location":"","host":"","include_sequence":["GENOME"],"pangolin_classification":"","refseq_only":false,"taxon":"10244"}

2023/10/25 09:48:17 
HTTP/2.0 200 OK
Content-Disposition: attachment; filename=ncbi_dataset.zip
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/zip
Date: Wed, 25 Oct 2023 07:48:17 GMT
Grpc-Metadata-Logging-Activity: download
Grpc-Metadata-Logging-Annotated_only: False
Grpc-Metadata-Logging-Refseq_only: False
Grpc-Metadata-Logging-Service: virus
Grpc-Metadata-Logging-Taxon: 10244
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: D909E87BA8639A64A00CB84C.2.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Vary: Accept-Encoding
X-Datasets-Version: 15.24.0
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block

Downloading: ncbi_dataset.zip    43.4MB 418kB/s
Error: Internal error (invalid zip archive). Please try again

Use datasets download virus genome taxon <command> --help for detailed help about a command.

Another strange behaviour: When I'm trying a different ISP (my cell phone one), I got the Internal error at different points in time

 2023/10/25 10:01:57 
GET /datasets/v2alpha/taxonomy/taxon/10244 HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/15.24.0/go
Accept: application/json
Ncbi-Phid: E32354406B0B070C2551A7A4
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download virus genome taxon 10244 --filename data/ncbi_dataset.zip --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 15.24.0
Accept-Encoding: gzip

2023/10/25 10:01:58 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Wed, 25 Oct 2023 08:01:58 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: E32354406B0B070C2551A7A4.1.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 15.24.0
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block

2023/10/25 10:01:58 
POST /datasets/v2alpha/virus/genome/download HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/15.24.0/go
Content-Length: 185
Accept: application/zip
Accept: application/json
Content-Type: application/json
Ncbi-Phid: E32354406B0B070C2551A7A4
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download virus genome taxon 10244 --filename data/ncbi_dataset.zip --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 15.24.0
Accept-Encoding: gzip

{"annotated_only":false,"complete_only":false,"format":"tsv","geo_location":"","host":"","include_sequence":["GENOME"],"pangolin_classification":"","refseq_only":false,"taxon":"10244"}

2023/10/25 10:01:58 
HTTP/2.0 200 OK
Content-Disposition: attachment; filename=ncbi_dataset.zip
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/zip
Date: Wed, 25 Oct 2023 08:01:58 GMT
Grpc-Metadata-Logging-Activity: download
Grpc-Metadata-Logging-Annotated_only: False
Grpc-Metadata-Logging-Refseq_only: False
Grpc-Metadata-Logging-Service: virus
Grpc-Metadata-Logging-Taxon: 10244
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: E32354406B0B070C2551A7A4.2.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Vary: Accept-Encoding
X-Datasets-Version: 15.24.0
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block

Downloading: data/ncbi_dataset.zip    244MB 2MB/s
Error: Internal error (invalid zip archive). Please try again

Use datasets download virus genome taxon <command> --help for detailed help about a command.

So 244 Mb. Sometimes it's 40, 110 or 170. Rarely the same number, contrary to with my work ISP.

I've tried datasets download virus genome taxon 10244. Got an issue. But I succeeded using --dehydrated. But the output format is not the one asked by the nextstrain pipeline.

Do you have curently network related issues ? like in #270 #253 #89 I've tried from 2 AM to 12 PM EST.

ericcox1 commented 12 months ago

Hi @cchapus,

Thanks for opening this issue. I was unable to reproduce the error from here, but since you provided the debug output we can investigate what went wrong in this particular case. I'll comment on this thread when we have some more information.

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI eric.cox@nih.gov

olearyna commented 11 months ago

Hi @cchapus,

We appreciate you bringing this issue to our attention. The problem you highlighted has been addressed in our latest release, version 15.27.1. Should you encounter any further issues, please don’t hesitate to let us know.

Best regards, Nuala

Nuala A. O'Leary Product Owner, NCBI Datasets National Center for Biotechnology Information, NLM, NIH, DHHS olearyna@nih.gov