ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
355 stars 39 forks source link

Download error for list of WP accessions #372

Open ericcox1 opened 4 months ago

ericcox1 commented 4 months ago

Hi @olearyna

I updated through conda --update, the version showing is 16.18.1, This is my code (I ran it with --debug as suggested):

datasets download gene accession --inputfile ~/Desktop/wp_1_50 --filename wp150 --include gene,protein --debug The error is:

Error: Download error: http2: server sent GOAWAY and closed the connection; LastDownloading: ncbi_dataset.zip 4.62MB error Find attached the screen capture with the phid.

phid

Thanks!

Originally posted by @carolinasisco in https://github.com/ncbi/datasets/issues/360#issuecomment-2143635518

ericcox1 commented 4 months ago

Hi @carolinasisco,

I'm opening a new issue since this is a separate problem from #360. We are continuing to investigate.

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets NIH/NLM/NCBI eric.cox@nih.gov

ericcox1 commented 4 months ago

Hi @carolinasisco,

Download failures for WP accessions annotated on large numbers of genomes is a known problem and we are continuing to research ways to make this work.

For example, I tested individual downloads of each WP accession in the list you provided and I was able to successfully download most of them, while I saw reproducible failures with WP_003084404.1, which is annotated on ~28 k genome assemblies.

Let me ask you, for proteins such as WP_003084404.1, are you interested in downloading the genomic sequence from all ~28 k genomes on which this protein is annotated? If the answer is yes, then I can tell you that we will continue looking at ways to make this work. If not, we may be able to point you to easier ways to get a smaller set of genome sequences.

I look forward to hearing from you soon.

Best, Eric

carolinasisco commented 4 months ago

Hi @ericcox1

Thank you so much for your effort. I'm only interested in obtaining the nucleotide and aminoacid sequences of these proteins from Pseudomonas aeruginosa PA14, a hipervilurent strain (for more context I will work with genome metabolic models). Carol

ericcox1 commented 4 months ago

That's helpful information. From the list of 50 WP accessions in your list, I found 4 WPs that are annotated on PA14 genomes: WP_003088572.1, WP_003101261.1, WP_003109333.1, WP_003138346.1

Normally, I would suggest using this command to specifically request genomic sequence from the Taxid of interest: datasets download gene accession WP_003088572.1 --include gene,protein --taxon-filter 'Pseudomonas aeruginosa PA14' --filename PA14.zip, but I found a bug where we are incorrectly reporting an error. While we investigate this bug, here is an alternative approach that uses curl against our API to get the data:

curl -o PA14-proteins.zip "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/protein/accession/WP_003088572.1%2CWP_003101261.1%2CWP_003109333.1%2CWP_003138346.1/download?include_annotation_type=FASTA_GENE&&include_annotation_type=FASTA_PROTEIN&taxon=652611"

Here's a peek at what the FASTA headers will look like for the genomic sequence included in the package:

unzip -cq PA14-proteins.zip ncbi_dataset/data/gene.fna | grep '>' | head -5
>NZ_CP104980.1:c3445703-3444279 TIGR00366 family protein [protein_accession=WP_003088572.1] [organism=Pseudomonas aeruginosa PA14] [name=TIGR00366 family protein]
>NZ_CP104980.1:c912510-911707 hpaH [protein_accession=WP_003101261.1] [organism=Pseudomonas aeruginosa PA14] [name=2-oxo-hepta-3-ene-1,7-dioic acid hydratase] [gene=hpaH]
>NZ_CP104980.1:1482995-1483945 accA [protein_accession=WP_003109333.1] [organism=Pseudomonas aeruginosa PA14] [name=acetyl-CoA carboxylase carboxyl transferase subunit alpha] [gene=accA]
>NZ_CP104981.1:c3445703-3444279 TIGR00366 family protein [protein_accession=WP_003088572.1] [organism=Pseudomonas aeruginosa PA14] [name=TIGR00366 family protein]
>NZ_CP104981.1:c912510-911707 hpaH [protein_accession=WP_003101261.1] [organism=Pseudomonas aeruginosa PA14] [name=2-oxo-hepta-3-ene-1,7-dioic acid hydratase] [gene=hpaH]

-Eric

carolinasisco commented 3 months ago

Hi @ericcox1

Thank you for your help, I tried with curl and it worked! Please, Let us know when the datasets download for WP accessions is working again.

Carol

gabepen commented 3 months ago

Checking in on the progress for this. I'm also having issues downloading WP accessions, specifically many within the gammaproteobacteria subtree are failing with the zip archive error. I am getting some success on certain sets of accessions (im downloading hundreds of different sets) but its rare. I've attached the debug output of one of the sets I am trying to download within a workflow. I'm not sure if its a different issue but when using the datasets python api I get this error:

gene_ids_for_accessions = [int(gene_rec.gene.gene_id) for gene_rec in gene_reply.genes] ^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'gene_id'

This is using the function described here: https://www.ncbi.nlm.nih.gov/datasets/docs/v1/how-tos/genes/download-gene-data-package/

CLI output: datasets download gene accession WP_042478116.1 WP_035893615.1 WP_010672261.1 WP_054661990.1 WP_024306120.1 WP_006660648.1 WP_003097132.1 WP_043192052.1 WP_000896506.1 WP_111259221.1 WP_067643804.1 WP_075184776.1 WP_009684663.1 WP_003367855.1 --include gene --filename outpath.zip --debug 2024/07/01 12:47:06 POST /datasets/v2alpha/protein/accession/download HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/1.0.0/go Content-Length: 296 Accept: text/plain,application/zip Content-Type: application/json Ncbi-Phid: 3E7681ACC0FC1E4745A072E1 X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download gene accession WP_042478116.1 WP_035893615.1 WP_010672261.1 WP_054661990.1 WP_024306120.1 WP_006660648.1 WP_003097132.1 WP_043192052.1 WP_000896506.1 WP_111259221.1 WP_067643804.1 WP_075184776.1 WP_009684663.1 WP_003367855.1 --include gene --filename outpath.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.20.0 Accept-Encoding: gzip

{"accessions":["WP_042478116.1","WP_035893615.1","WP_010672261.1","WP_054661990.1","WP_024306120.1","WP_006660648.1","WP_003097132.1","WP_043192052.1","WP_000896506.1","WP_111259221.1","WP_067643804.1","WP_075184776.1","WP_009684663.1","WP_003367855.1"],"include_annotation_type":["FASTA_GENE"]}

2024/07/01 12:47:11 HTTP/2.0 200 OK Content-Disposition: attachment; filename=ncbi_dataset.zip Content-Security-Policy: upgrade-insecure-requests Content-Type: application/zip Date: Mon, 01 Jul 2024 19:47:11 GMT Grpc-Metadata-Logging-Accessions: WP_000896506.1,WP_003097132.1,WP_003367855.1,WP_006660648.1,WP_009684663.1,WP_010672261.1,WP_024306120.1,WP_035893615.1,WP_042478116.1,WP_043192052.1,WP_054661990.1,WP_067643804.1,WP_075184776.1,WP_111259221.1 Grpc-Metadata-Logging-Accessions_count: 14 Grpc-Metadata-Logging-Activity: download Grpc-Metadata-Logging-Include_annotation_type: FASTA_GENE Grpc-Metadata-Logging-Service: prokaryote Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: 3E7681ACC0FC1E4745A072E1.1.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload Vary: Accept-Encoding X-Datasets-Version: 16.22.1 X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets. X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block

New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets. Downloading: outpath.zip 934kB done Validating package [] Error: Internal error (invalid zip archive). Please try again

ericcox1 commented 3 months ago

Hi @gabepen,

Thanks for your comment.

We haven't had a chance to look into this yet due to other institutional priorities.

Just to confirm, are you interested in downloading all underlying genomic sequences for each protein in your query? For example, for WP_003097132.1, this protein is annotated on close to 10 k genomes. Do you need the genomic sequences from each of the 10 k genomes?

Best, Eric

gabepen commented 3 months ago

@ericcox1

It depends on each query. It doesn't seem to be a gene dataset size issue though, I've tested the --taxon-filter option and still get the zip archive error for a single sequence.

I am also noticing that certain tax IDs return this error: download gene by accession data is currently available for this taxon. Is there a reason for this?

ericcox1 commented 3 months ago

Thanks @gabepen.

I've tested the --taxon-filter option and still get the zip archive error for a single sequence.

Good point. We are going to release a fix for this bug sometime next week.

Fixing the download errors for WP accessions annotated on many thousands of genomes is going to take some more research and we don't have a definite timeline for this yet.

I am also noticing that certain tax IDs return this error: download gene by accession data is currently available for this taxon. Is there a reason for this?

Could you please share an example for this issue?

Best, Eric

gabepen commented 3 months ago

@ericcox1

I thought I was getting the error for taxon without a genome labeled as a reference but I tested it with this taxon and received the same error:


datasets download gene accession 'WP_000818647.1' --include gene --taxon-filter 2774015 --debug
2024/07/05 14:38:23 
POST /datasets/v2alpha/taxonomy/taxon_suggest HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/1.0.0/go
Content-Length: 130
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download gene accession WP_000818647.1 --include gene --taxon-filter 2774015 --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.20.0
Accept-Encoding: gzip

{"exact_match":true,"tax_rank_filter":"higher_taxon","taxon_query":"2774015","taxon_resource_filter":"TAXON_RESOURCE_FILTER_ALL"}

2024/07/05 14:38:27 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Fri, 05 Jul 2024 21:38:27 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4.1.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.22.1
X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block

2024/07/05 14:38:27 
POST /datasets/v2alpha/taxonomy HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/1.0.0/go
Content-Length: 53
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download gene accession WP_000818647.1 --include gene --taxon-filter 2774015 --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.20.0
Accept-Encoding: gzip

{"returned_content":"COMPLETE","taxons":["2774015"]}

2024/07/05 14:38:28 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Fri, 05 Jul 2024 21:38:27 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4.2.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.22.1
X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block

2024/07/05 14:38:28 
POST /datasets/v2alpha/taxonomy/taxon_suggest HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/1.0.0/go
Content-Length: 153
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download gene accession WP_000818647.1 --include gene --taxon-filter 2774015 --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.20.0
Accept-Encoding: gzip

{"exact_match":true,"tax_rank_filter":"higher_taxon","taxon_query":"Pectobacterium quasiaquaticum","taxon_resource_filter":"TAXON_RESOURCE_FILTER_GENE"}

2024/07/05 14:38:28 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Fri, 05 Jul 2024 21:38:28 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4.3.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.22.1
X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block

Error: The taxonomy ID '2774015' is valid for 'Pectobacterium quasiaquaticum', but no download gene by accession data is currently available for this taxon.

And then I noticed I get a different error when the WP accessions are passed as a list:


datasets download gene accession ['WP_000818647.1'] --inclu
de gene --taxon-filter 2774015 --debug
2024/07/05 14:40:35 
POST /datasets/v2alpha/gene HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/1.0.0/go
Content-Length: 146
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 115B94F10BCFA5CFB1656134
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download gene accession [WP_000818647.1] --include gene --taxon-filter 2774015 --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.20.0
Accept-Encoding: gzip

{"accessions":["[WP_000818647.1]"],"include_tabular_header":"INCLUDE_TABULAR_HEADER_FIRST_PAGE_ONLY","page_size":1,"returned_content":"IDS_ONLY"}

2024/07/05 14:40:39 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Fri, 05 Jul 2024 21:40:39 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 115B94F10BCFA5CFB1656134.1.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.22.1
X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block

New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
Error: No genes found that match selection

If I pass a large enough list of accessions I will find some gene records for the taxid I tested above, but I'm confident that the single one tested is annotated on the reference genome.

ericcox1 commented 2 months ago

Hi @carolinasisco and @gabepen,

Thanks for your patience.

Here's an update:

  1. We are continuing to investigate how to better support requests for large numbers of genome sequences (10 k+) for a given WP.
  2. We have fixed the bug where we are incorrectly returning an error for certain WPs

For example, after updating to 16.23.0, this now works:

datasets download gene accession WP_003088572.1 --include gene,protein --taxon-filter 'Pseudomonas aeruginosa PA14' --filename PA14.zip
Downloading: PA14.zip    4.69kB valid zip structure -- files not checked
Validating package [================================================] 100% 6/6

Best, Eric