ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
368 stars 41 forks source link

404 during rehydration #198

Closed bernt-matthias closed 1 year ago

bernt-matthias commented 1 year ago

I get the following error during rehydration

Error: 
File unavailable ([gateway] 404 Not Found): ncbi_dataset/data/GCF_003031525.2/GCF_003031525.2_Neophocaena_asiaeorientalis_V1_genomic.fna.gz
File unavailable ([gateway] 404 Not Found): ncbi_dataset/data/GCF_003031525.2/genomic.gff.gz

used command line

datasets download genome taxon 'Chordata' --tax-exact-match --reference --annotated --assembly-version latest --assembly-source refseq   --include genome,gff3   --no-progressbar --dehydrated  
dataformat tsv genome --package ncbi_dataset.zip --fields accession,assminfo-name,assminfo-submitter,assmstats-total-ungapped-len,organism-name > genome_data_report.tsv 
7z x -y ncbi_dataset.zip > 7z.log
datasets rehydrate --directory ./ --gzip --max-workers 10
ericcox1 commented 1 year ago

Hi @bernt-matthias,

Thanks for opening this issue.

We think that this was caused by a filename change that happened between the time that you downloaded the dehydrated package and when you rehydrated the package. I have a question, did the rehydration stop when the error was reported or did it continue despite the error?

At the moment, we would like to report these errors to the user but still be able finish the rehydration for all other files.

In the meantime, you should now be able to download a new dehydrated package, then rehydrate without seeing this error.

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI eric.cox@nih.gov

bernt-matthias commented 1 year ago

I'm not sure. The last message on stdout is:

Completed 944 of 946 [===============================================>] 100%

so the two datasets are missing. I thought these are coincidentally the last ones.

Anyway the tool stops with exit code 1.

At the moment, we would like to report these errors to the user but still be able finish the rehydration for all other files.

Seems like a good idea. The error message could be improved, e.g. mention that the download will continue.

Note that in an automatized environment like Galaxy the data would be unusable. I was thinking if a command line argument that allows to change the behavior:

The first is probably useful for interactive usage. And the two later ones for use in environments like Galaxy