ropensci / taxa

taxonomic classes for R
https://docs.ropensci.org/taxa
Other
48 stars 12 forks source link

HTTP errors when parsing long taxon_id list #202

Open janstrauss1 opened 4 years ago

janstrauss1 commented 4 years ago

Hi there,

I'm trying to create a taxmap from a long list of NCBI taxon IDs for subsequent filtering.

I have downloaded about 17k taxa containing a specific protein domain from InterPro and imported into R

my.tax_id <- read.table(file = "TaxID_IPR012674.txt")
> str(my.tax_id)
'data.frame':   17482 obs. of  1 variable:
 $ V1: int  104 158 162 166 17 172 192 195 196 197

I then try to set um my taxmap as follows:

my.taxmap <- lookup_tax_data(
  tax_data = my.tax_id, 
  type = "taxon_id", 
  column = 1, 
  datasets = list(),
  mappings = c(), 
  database = "ncbi", 
  include_tax_data = TRUE,
  use_database_ids = TRUE, 
  ask = TRUE
  )
Looking up classifications for 17482 unique taxon IDs from database "ncbi"...

Unfortunately, this throws the error Error: Too Many Requests (HTTP 429)

I guess the API client is making too many concurrent requests to the database which causes the error.

Could you please help to fix it?

Many thanks in advance!

The output of sessionInfo() is

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] urltools_1.7.3 taxize_0.9.91  taxa_0.3.2    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3        pillar_1.4.2      compiler_3.6.1    plyr_1.8.4        iterators_1.0.12  tools_3.6.1      
 [7] jsonlite_1.6      tibble_2.1.3      nlme_3.1-141      lattice_0.20-38   pkgconfig_2.0.3   rlang_0.4.1      
[13] foreach_1.4.7     cli_1.1.0         rstudioapi_0.10   crul_0.9.0        curl_4.2          parallel_3.6.1   
[19] dplyr_0.8.3       stringr_1.4.0     xml2_1.2.2        triebeard_0.3.0   grid_3.6.1        tidyselect_0.2.5 
[25] reshape_0.8.8     glue_1.3.1        httpcode_0.2.0    data.table_1.12.6 R6_2.4.1          reshape2_1.4.3   
[31] purrr_0.3.3       magrittr_1.5      codetools_0.2-16  assertthat_0.2.1  bold_0.9.0        ape_5.3          
[37] stringi_1.4.3     crayon_1.3.4      zoo_1.8-6  
janstrauss1 commented 4 years ago

there seems to be a related issue for the taxize package https://github.com/ropensci/taxize/issues/785#issuecomment-554462753

sckott commented 4 years ago

Are you definitely using NCBI? The data source in question in that taxize issue 785 is for Catalogue of Life, not NCBI. Anyway, NCBI may also throw 429 errors. Do you have an NCBI ENTREZ API key set with the env var ENTREZ_KEY ?

janstrauss1 commented 4 years ago

@sckott, yes, I'm definitely using NCBI taxon IDs. No, I did not set an ENTREZ_KEY but I think this might solve the problem. I have already obtained an NCBI API key but how to I set it correctly?

Many thanks in advance for your help!

janstrauss1 commented 4 years ago

@sckott, I just set the key using Sys.setenv(ENTREZ_KEY = "my.api.key") as you outlined at https://github.com/ropensci/taxa/issues/135#issuecomment-370862861. It seems to partially solve my issue as the download stalled at 7% throwing the error: Error: Bad Request (HTTP 400). Any idea how to address this?

janstrauss1 commented 4 years ago

It appears that downloading the classifications for such a long list of taxon IDs from NCBI is very fragile. Setting my NCBI API key and re-running my script as outlined above, the download now stalled at 25% throwing the error: Bad Gateway (HTTP 502).

janstrauss1 commented 4 years ago

It eventually worked to download the classifications of the full 17k list of NCBI taxon IDs.

sckott commented 4 years ago

NCBI's infrastructure is not very good, so I'm not surprised that you are running into errors with a lot of names.

Another option is taxizedb - idea is the same as taxize, but using SQL dumps on your local machine.

ErwinFeringa commented 11 months ago

I have been running into an issue for some time now trying to parse my data with lookup_tax_data. I have around 4k of tax_id's and I want to visualize them together with their fraction total reads within a heat tree.

this is what I run:

Sys.setenv(ENTREZ_KEY = "my key") data15 <- read.delim("path to my file") taxed_15 <- lookup_tax_data( data15, "taxon_id", column = 2, datasets = list("fraction_total_reads"), mappings = c("value), database = "ncbi", include_tax_data = TRUE, use_database_ids = TRUE, ask = TRUE )

I either get the following errors: Error: Bad Request (HTTP 400) or: Error in get_sort_var(tax_data, names(sort_var)) : No column named ""."

the last error does not show up if i leave out "datasets" and " mapping"

I hope there is a way to solve the problems i am facing.

RJGrayEcology commented 10 months ago

Is this still not solved? I have the same problem with a list of about 600 species.

zachary-foster commented 10 months ago

Are these errors random, or the same every time? If the latter, can you give me a command to test that causes this error?

morellek commented 8 months ago

I had the same error, and what did the trick to me, is to include the query in a 'try-error' function, and if the Error: Bad Request (HTTP 400) message appeared, than I used the Sys.sleep() and retried the query. In a loop, looks like:

for (i in 1: nrow(data)) {

classes_i <- try(tax_name(sci = data$taxon[i], get = c("genus","family","order","class"), db = "ncbi")) if (class(classes_i)=="try-error") { Sys.sleep(10) classes_i <- try(tax_name(sci = data$taxon[i], get = c("genus","family","order","class"), db = "ncbi"))} classes_both <- rbind(classes_both, classes_i) }

stephanJG commented 8 months ago

Thanks morellek, the loop worked for me. Was getting frustrated that even after getting the ncbi api key and using Sys.sleep in my similar loop I still got the Error: Bad Request (HTTP 400) message. I still get some rows filled with the error messag, but that I can fix.

PS: classes_both = NULL before the loop is missing