ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
368 stars 41 forks source link

Getting error "Error: Internal error (invalid zip archive). Please try again." Take 2 #360

Closed corneliusroemer closed 5 months ago

corneliusroemer commented 5 months ago

Sadly the issue is still active, at least for taxons ebola-zaire and mpox.

See #356

New version of client (16.16.0) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-arm64/datasets.
Error: Internal error (invalid zip archive). Please try again

Originally posted by @corneliusroemer in https://github.com/ncbi/datasets/issues/356#issuecomment-2111024211

ericcox1 commented 5 months ago

Thanks @corneliusroemer, we are continuing to look into this. Would you mind updating to 16.16.0 and if the problem persists, please include --debug and report the phid. This will help us to better understand what went wrong.

Best, Eric

joverlee521 commented 5 months ago

I am also seeing this error in our automated pipelines for zika, mpox, measles, and dengue, which are all scheduled to run at 9AM PDT. If I rerun the workflow at a later time, the error goes away. Does the time coincide with the datasets updates?

corneliusroemer commented 5 months ago

@ericcox1 Yes, getting the error with 16.16.0 as well. An example run is: Ncbi-Phid: 1D715361FD2DDA414583C0181D715361FD2DDA414583C018 (it might be that this exact run happened to work, I can't tell as having run --debug my terminal got flooded with binary text). I'll try to provoke an error again.

Is it possible that some part of the server struggles with the number of requests it's getting? As part of a project, I'm doing dataset downloads via CLI for a few taxa around every 3 minutes (it's run as part of CI). It's done with API key and the allowed rate is 10 requests per second so we should be far away from that limit but it might still be that no one else hitherto has sent requests so frequently.

AngieHinrichs commented 5 months ago

I've been getting the same error (Error: Internal error (invalid zip archive). Please try again) repeatedly for the past several days while trying to get influenza A genomes with this command:

datasets download virus genome taxon 11320 --include genome,biosample --debug >& datasets.log

Here is the gzipped --debug output: datasets.log.gz

The download proceeds for a varying amount of time (~two to 39 minutes) and downloads a varying amount of data (haven't kept track but noticed different numbers of GB) before exiting with the error.

I'm using datasets version: 16.17.0

AngieHinrichs commented 5 months ago

Earlier today, this command succeeded for me:

datasets download virus genome taxon "Alphainfluenzavirus influenzae" --filename all_alphainfluenza.zip

-- it's the first example command on https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/virus/get-influenza-genomes/ . In 87 minutes, it downloaded a 555MB (530MiB) file that includes data_report.jsonl and genome.fna, but not biosample.jsonl.

Unfortunately the command above with --include genome,biosample has failed twice this afternoon, both times making it to 67.3MB before getting the invalid zip archive error.

olearyna commented 5 months ago

Hi AngieHinrichs,

Thanks for opening the issue. We're looking into it.

Nuala

olearyna commented 5 months ago

@AngieHinrichs,

Can you run this again with the --debug flag and send us the PHID? - thanks!

AngieHinrichs commented 5 months ago

OK, I am kicking off this command (there's no --no-progress-bar option, so adding a grep -v) and will send PHID and log. Thanks!

time datasets download virus genome taxon 11320 --include genome,biosample --debug |& grep -v ^$'\033' > datasets.log
AngieHinrichs commented 5 months ago

OK, PHID is 2F4065564DC261B8F1FA965F. Log attached. datasets.2024-05-24.log.gz

olearyna commented 5 months ago

Hi AngieHinrichs,

We need to take a deeper look at the issue. We'll post her when we have a fix.

Nuala

AngieHinrichs commented 5 months ago

Thanks @olearyna!

carolinasisco commented 5 months ago

Hi,

Any good news on this? I had the same error since Monday, I though it was something wrong with my code until I read this post.

olearyna commented 5 months ago

Hi carolinasisco,

We are actively working on a fix and aim to have it released within the week. We apologize for any inconvenience this may have caused. Thanks for the patience!

Nuala

olearyna commented 5 months ago

Hi carolinasisco and AngieHinrichs,

We have released a fix in the latest version (v16.18.1) of the command line tool that we believe addresses the reported issues. Please test this update and let us know if you encounter any further errors.

Thanks Nuala

AngieHinrichs commented 5 months ago

Thanks @olearyna, I'll try it out right away!

AngieHinrichs commented 5 months ago

It worked and it was much faster than before! Thanks again!

olearyna commented 5 months ago

Great! I'll close this issue.

carolinasisco commented 5 months ago

Hi, it did not worked for me, any suggestions? Got the same error

corneliusroemer commented 5 months ago

Thanks so much @olearyna and @ericcox1! I just upgraded to 16.18.1 and the first run is optimistic, none of the 4 taxon downloads failed. 🎉

I will comment as soon as I see failures again.

@carolinasisco are you sure you're using version 16.18.1?

I think it would help the devs if you could run with --debug then and share the PHID 😀

olearyna commented 5 months ago

Hi @carolinasisco,

Yes, if you are still having issues with the latest version can you run --debug and share the PHID. Thanks for the suggestion corneliusroemer!

carolinasisco commented 5 months ago

Hi @olearyna

I updated through conda --update, the version showing is 16.18.1, This is my code (I ran it with --debug as suggested):

datasets download gene accession --inputfile ~/Desktop/wp_1_50 --filename wp150 --include gene,protein --debug The error is:

Error: Download error: http2: server sent GOAWAY and closed the connection; LastDownloading: ncbi_dataset.zip 4.62MB error Find attached the screen capture with the phid.

phid

Thanks!

olearyna commented 5 months ago

Hi carolinasisco,

Thanks for the information! I think this is a separate issue from the virus genome download. We'll look into it tomorrow.

Nuala

carolinasisco commented 5 months ago

Hi, thank you. I'm trying to download a large set of sequences (nt and aa) from pseudomonas.

mverce commented 3 weeks ago

Hi, I would like to add another example of this error, in hopes of it being helpful in finding a solution. I am using ncbi datasets version 16.31.0. I was trying to download Streptococcus genomic sequences using the following command: datasets download genome taxon Streptococcus --include genome,gbff --reference

This results in the following outcome: Collecting 125 genome records [================================================] 100% 125/125 Downloading: ncbi_dataset.zip 273MB done Validating package files [==>---------------------------------------------] 9% 23/254 Error: Internal error (invalid zip archive). Please try again

On several attempts, the validation of the package files reaches 6 - 9 %.

I reran the command while including either genomes or gbff. When downloading genomes only (--include genome), the process finished successfully. When downloading gbff only (--include gbff) the process failed with the same Internal Error as mentioned above.

ericcox1 commented 3 weeks ago

Hi @mverce,

Thanks for your report.

I wasn't able to reproduce this error and we think you may have encountered a temporary problem.

If you don't mind trying this one more time, please add the --debug flag and report the Ncbi-phid value here so we can investigate further.

datasets download genome taxon Streptococcus --include gbff --reference --filename strep.zip --debug

Best, Eric

mverce commented 2 weeks ago

Hi @ericcox1,

I have tried it again with the commands that were problematic yesterday, as well as with your exact command (incl. --filename strep.zip), but the problem persists. The last Ncbi-Phid from the debug output is: 1CA6C01E4134F3592F685054.6.1

Thanks and best regards, Marko

corneliusroemer commented 2 weeks ago

I tried the same command as Eric listed and can't reproduce