Open ericcox1 opened 4 months ago
Hi @carolinasisco,
I'm opening a new issue since this is a separate problem from #360. We are continuing to investigate.
Best, Eric
Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets NIH/NLM/NCBI eric.cox@nih.gov
Hi @carolinasisco,
Download failures for WP accessions annotated on large numbers of genomes is a known problem and we are continuing to research ways to make this work.
For example, I tested individual downloads of each WP accession in the list you provided and I was able to successfully download most of them, while I saw reproducible failures with WP_003084404.1, which is annotated on ~28 k genome assemblies.
Let me ask you, for proteins such as WP_003084404.1, are you interested in downloading the genomic sequence from all ~28 k genomes on which this protein is annotated? If the answer is yes, then I can tell you that we will continue looking at ways to make this work. If not, we may be able to point you to easier ways to get a smaller set of genome sequences.
I look forward to hearing from you soon.
Best, Eric
Hi @ericcox1
Thank you so much for your effort. I'm only interested in obtaining the nucleotide and aminoacid sequences of these proteins from Pseudomonas aeruginosa PA14, a hipervilurent strain (for more context I will work with genome metabolic models). Carol
That's helpful information. From the list of 50 WP accessions in your list, I found 4 WPs that are annotated on PA14 genomes: WP_003088572.1, WP_003101261.1, WP_003109333.1, WP_003138346.1
Normally, I would suggest using this command to specifically request genomic sequence from the Taxid of interest:
datasets download gene accession WP_003088572.1 --include gene,protein --taxon-filter 'Pseudomonas aeruginosa PA14' --filename PA14.zip
, but I found a bug where we are incorrectly reporting an error. While we investigate this bug, here is an alternative approach that uses curl against our API to get the data:
curl -o PA14-proteins.zip "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/protein/accession/WP_003088572.1%2CWP_003101261.1%2CWP_003109333.1%2CWP_003138346.1/download?include_annotation_type=FASTA_GENE&&include_annotation_type=FASTA_PROTEIN&taxon=652611"
Here's a peek at what the FASTA headers will look like for the genomic sequence included in the package:
unzip -cq PA14-proteins.zip ncbi_dataset/data/gene.fna | grep '>' | head -5
>NZ_CP104980.1:c3445703-3444279 TIGR00366 family protein [protein_accession=WP_003088572.1] [organism=Pseudomonas aeruginosa PA14] [name=TIGR00366 family protein]
>NZ_CP104980.1:c912510-911707 hpaH [protein_accession=WP_003101261.1] [organism=Pseudomonas aeruginosa PA14] [name=2-oxo-hepta-3-ene-1,7-dioic acid hydratase] [gene=hpaH]
>NZ_CP104980.1:1482995-1483945 accA [protein_accession=WP_003109333.1] [organism=Pseudomonas aeruginosa PA14] [name=acetyl-CoA carboxylase carboxyl transferase subunit alpha] [gene=accA]
>NZ_CP104981.1:c3445703-3444279 TIGR00366 family protein [protein_accession=WP_003088572.1] [organism=Pseudomonas aeruginosa PA14] [name=TIGR00366 family protein]
>NZ_CP104981.1:c912510-911707 hpaH [protein_accession=WP_003101261.1] [organism=Pseudomonas aeruginosa PA14] [name=2-oxo-hepta-3-ene-1,7-dioic acid hydratase] [gene=hpaH]
-Eric
Hi @ericcox1
Thank you for your help, I tried with curl and it worked! Please, Let us know when the datasets download for WP accessions is working again.
Carol
Checking in on the progress for this. I'm also having issues downloading WP accessions, specifically many within the gammaproteobacteria subtree are failing with the zip archive error. I am getting some success on certain sets of accessions (im downloading hundreds of different sets) but its rare. I've attached the debug output of one of the sets I am trying to download within a workflow. I'm not sure if its a different issue but when using the datasets python api I get this error:
gene_ids_for_accessions = [int(gene_rec.gene.gene_id) for gene_rec in gene_reply.genes] ^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'gene_id'
This is using the function described here: https://www.ncbi.nlm.nih.gov/datasets/docs/v1/how-tos/genes/download-gene-data-package/
CLI output: datasets download gene accession WP_042478116.1 WP_035893615.1 WP_010672261.1 WP_054661990.1 WP_024306120.1 WP_006660648.1 WP_003097132.1 WP_043192052.1 WP_000896506.1 WP_111259221.1 WP_067643804.1 WP_075184776.1 WP_009684663.1 WP_003367855.1 --include gene --filename outpath.zip --debug 2024/07/01 12:47:06 POST /datasets/v2alpha/protein/accession/download HTTP/1.1 Host: api.ncbi.nlm.nih.gov User-Agent: OpenAPI-Generator/1.0.0/go Content-Length: 296 Accept: text/plain,application/zip Content-Type: application/json Ncbi-Phid: 3E7681ACC0FC1E4745A072E1 X-Datasets-Client: datasets-cli X-Datasets-Client-Arch: amd64 X-Datasets-Client-Cmd: download gene accession WP_042478116.1 WP_035893615.1 WP_010672261.1 WP_054661990.1 WP_024306120.1 WP_006660648.1 WP_003097132.1 WP_043192052.1 WP_000896506.1 WP_111259221.1 WP_067643804.1 WP_075184776.1 WP_009684663.1 WP_003367855.1 --include gene --filename outpath.zip --debug X-Datasets-Client-Os: linux X-Datasets-Client-Version: 16.20.0 Accept-Encoding: gzip
{"accessions":["WP_042478116.1","WP_035893615.1","WP_010672261.1","WP_054661990.1","WP_024306120.1","WP_006660648.1","WP_003097132.1","WP_043192052.1","WP_000896506.1","WP_111259221.1","WP_067643804.1","WP_075184776.1","WP_009684663.1","WP_003367855.1"],"include_annotation_type":["FASTA_GENE"]}
2024/07/01 12:47:11 HTTP/2.0 200 OK Content-Disposition: attachment; filename=ncbi_dataset.zip Content-Security-Policy: upgrade-insecure-requests Content-Type: application/zip Date: Mon, 01 Jul 2024 19:47:11 GMT Grpc-Metadata-Logging-Accessions: WP_000896506.1,WP_003097132.1,WP_003367855.1,WP_006660648.1,WP_009684663.1,WP_010672261.1,WP_024306120.1,WP_035893615.1,WP_042478116.1,WP_043192052.1,WP_054661990.1,WP_067643804.1,WP_075184776.1,WP_111259221.1 Grpc-Metadata-Logging-Accessions_count: 14 Grpc-Metadata-Logging-Activity: download Grpc-Metadata-Logging-Include_annotation_type: FASTA_GENE Grpc-Metadata-Logging-Service: prokaryote Grpc-Metadata-Via: h2 linkerd Ncbi-Phid: 3E7681ACC0FC1E4745A072E1.1.1 Server: Apache Strict-Transport-Security: max-age=31536000; includeSubDomains; preload Vary: Accept-Encoding X-Datasets-Version: 16.22.1 X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets. X-Ua-Compatible: IE=Edge X-Xss-Protection: 1; mode=block
New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets. Downloading: outpath.zip 934kB done Validating package [] Error: Internal error (invalid zip archive). Please try again
Hi @gabepen,
Thanks for your comment.
We haven't had a chance to look into this yet due to other institutional priorities.
Just to confirm, are you interested in downloading all underlying genomic sequences for each protein in your query? For example, for WP_003097132.1, this protein is annotated on close to 10 k genomes. Do you need the genomic sequences from each of the 10 k genomes?
Best, Eric
@ericcox1
It depends on each query. It doesn't seem to be a gene dataset size issue though, I've tested the --taxon-filter option and still get the zip archive error for a single sequence.
I am also noticing that certain tax IDs return this error: download gene by accession data is currently available for this taxon. Is there a reason for this?
Thanks @gabepen.
I've tested the --taxon-filter option and still get the zip archive error for a single sequence.
Good point. We are going to release a fix for this bug sometime next week.
Fixing the download errors for WP accessions annotated on many thousands of genomes is going to take some more research and we don't have a definite timeline for this yet.
I am also noticing that certain tax IDs return this error: download gene by accession data is currently available for this taxon. Is there a reason for this?
Could you please share an example for this issue?
Best, Eric
@ericcox1
I thought I was getting the error for taxon without a genome labeled as a reference but I tested it with this taxon and received the same error:
datasets download gene accession 'WP_000818647.1' --include gene --taxon-filter 2774015 --debug
2024/07/05 14:38:23
POST /datasets/v2alpha/taxonomy/taxon_suggest HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/1.0.0/go
Content-Length: 130
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download gene accession WP_000818647.1 --include gene --taxon-filter 2774015 --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.20.0
Accept-Encoding: gzip
{"exact_match":true,"tax_rank_filter":"higher_taxon","taxon_query":"2774015","taxon_resource_filter":"TAXON_RESOURCE_FILTER_ALL"}
2024/07/05 14:38:27
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Fri, 05 Jul 2024 21:38:27 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4.1.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.22.1
X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block
2024/07/05 14:38:27
POST /datasets/v2alpha/taxonomy HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/1.0.0/go
Content-Length: 53
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download gene accession WP_000818647.1 --include gene --taxon-filter 2774015 --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.20.0
Accept-Encoding: gzip
{"returned_content":"COMPLETE","taxons":["2774015"]}
2024/07/05 14:38:28
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Fri, 05 Jul 2024 21:38:27 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4.2.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.22.1
X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block
2024/07/05 14:38:28
POST /datasets/v2alpha/taxonomy/taxon_suggest HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/1.0.0/go
Content-Length: 153
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download gene accession WP_000818647.1 --include gene --taxon-filter 2774015 --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.20.0
Accept-Encoding: gzip
{"exact_match":true,"tax_rank_filter":"higher_taxon","taxon_query":"Pectobacterium quasiaquaticum","taxon_resource_filter":"TAXON_RESOURCE_FILTER_GENE"}
2024/07/05 14:38:28
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Fri, 05 Jul 2024 21:38:28 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 70C0DA622ABF2A790F8B11B4.3.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.22.1
X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block
Error: The taxonomy ID '2774015' is valid for 'Pectobacterium quasiaquaticum', but no download gene by accession data is currently available for this taxon.
And then I noticed I get a different error when the WP accessions are passed as a list:
datasets download gene accession ['WP_000818647.1'] --inclu
de gene --taxon-filter 2774015 --debug
2024/07/05 14:40:35
POST /datasets/v2alpha/gene HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/1.0.0/go
Content-Length: 146
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 115B94F10BCFA5CFB1656134
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download gene accession [WP_000818647.1] --include gene --taxon-filter 2774015 --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.20.0
Accept-Encoding: gzip
{"accessions":["[WP_000818647.1]"],"include_tabular_header":"INCLUDE_TABULAR_HEADER_FIRST_PAGE_ONLY","page_size":1,"returned_content":"IDS_ONLY"}
2024/07/05 14:40:39
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Fri, 05 Jul 2024 21:40:39 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 115B94F10BCFA5CFB1656134.1.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.22.1
X-Datasets-Version-Message: New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block
New version of client (16.22.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets.
Error: No genes found that match selection
If I pass a large enough list of accessions I will find some gene records for the taxid I tested above, but I'm confident that the single one tested is annotated on the reference genome.
Hi @carolinasisco and @gabepen,
Thanks for your patience.
Here's an update:
For example, after updating to 16.23.0
, this now works:
datasets download gene accession WP_003088572.1 --include gene,protein --taxon-filter 'Pseudomonas aeruginosa PA14' --filename PA14.zip
Downloading: PA14.zip 4.69kB valid zip structure -- files not checked
Validating package [================================================] 100% 6/6
Best, Eric
Hi @olearyna
I updated through conda --update, the version showing is 16.18.1, This is my code (I ran it with --debug as suggested):
datasets download gene accession --inputfile ~/Desktop/wp_1_50 --filename wp150 --include gene,protein --debug The error is:
Error: Download error: http2: server sent GOAWAY and closed the connection; LastDownloading: ncbi_dataset.zip 4.62MB error Find attached the screen capture with the phid.
Thanks!
Originally posted by @carolinasisco in https://github.com/ncbi/datasets/issues/360#issuecomment-2143635518