ropensci / taxize

A taxonomic toolbelt for R
https://docs.ropensci.org/taxize
Other
267 stars 60 forks source link

API rate limit exceeded #856

Closed nick-youngblut closed 3 years ago

nick-youngblut commented 3 years ago

I was just re-running code that I previously ran a couple of months ago without problems, and now I'm getting the following error:

Error: '{"error":"API rate limit exceeded","api-key":"XXXXXXXXXXXXXXXXXXXXX","count":"11","limit":"10"}
' does not exist in current working directory ('/ebio/abt3_projects/my_project/notebooks/').
Traceback:

1. classification(vector_of_my_species, db = "ncbi", batch_size = 5)
2. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
3. eval(quote(`_fseq`(`_lhs`)), env, env)
4. eval(quote(`_fseq`(`_lhs`)), env, env)
5. `_fseq`(`_lhs`)
6. freduce(value, `_function_list`)
7. withVisible(function_list[[k]](value))
8. function_list[[k]](value)
9. classification(., db = "ncbi", batch_size = 5)
10. classification.default(., db = "ncbi", batch_size = 5)
11. process_ids(sci_id, db, get_uid, rows = rows)
12. eval(fxn)(input, ...)
13. xml2::read_xml(raw_xml_result)
14. read_xml.character(raw_xml_result)
15. path_to_connection(x)
16. check_path(path)
17. stop("'", path, "' does not exist", if (!is_absolute_path(path)) paste0(" in current working directory ('", 
  .     getwd(), "')"), ".", call. = FALSE)

The code being run is:

 classification(vector_of_my_species, db = 'ncbi', batch_size=5)

Changing the batch size doesn't help.

My ~/.Renviron contains ENTREZ_KEY='XXXXXXXXXXXXXXXXXXXXXX'

SessionInfo

R version 3.6.2 (2019-12-12)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/my_project/libopenblasp-r0.3.7.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] LeyLabRMisc_0.1.6 taxize_0.9.97     readxl_1.3.1      ape_5.3          
[5] phyloseq_1.30.0   ggplot2_3.2.1     tidyr_1.0.0       dplyr_0.8.5      

loaded via a namespace (and not attached):
 [1] Biobase_2.46.0      jsonlite_1.6        splines_3.6.2      
 [4] foreach_1.4.7       bold_1.1.0          assertthat_0.2.1   
 [7] triebeard_0.3.0     urltools_1.7.3      stats4_3.6.2       
[10] cellranger_1.1.0    pillar_1.4.3        lattice_0.20-38    
[13] glue_1.3.1          uuid_0.1-2          digest_0.6.23      
[16] XVector_0.26.0      colorspace_1.4-1    htmltools_0.4.0    
[19] Matrix_1.2-18       plyr_1.8.5          pkgconfig_2.0.3    
[22] httpcode_0.3.0      zlibbioc_1.32.0     purrr_0.3.3        
[25] scales_1.1.0        tibble_2.1.3        mgcv_1.8-31        
[28] farver_2.0.2        IRanges_2.20.0      ellipsis_0.3.0     
[31] withr_2.1.2         repr_1.0.2          BiocGenerics_0.32.0
[34] lazyeval_0.2.2      cli_2.0.1           survival_3.1-8     
[37] magrittr_1.5        crayon_1.3.4        evaluate_0.14      
[40] fansi_0.4.1         nlme_3.1-143        MASS_7.3-51.5      
[43] xml2_1.2.2          vegan_2.5-6         tools_3.6.2        
[46] data.table_1.12.8   lifecycle_0.1.0     stringr_1.4.0      
[49] Rhdf5lib_1.8.0      S4Vectors_0.24.0    munsell_0.5.0      
[52] cluster_2.1.0       Biostrings_2.54.0   ade4_1.7-13        
[55] compiler_3.6.2      rlang_0.4.6         conditionz_0.1.0   
[58] rhdf5_2.30.0        grid_3.6.2          pbdZMQ_0.3-3       
[61] iterators_1.0.12    IRkernel_1.1        biomformat_1.14.0  
[64] igraph_1.2.4.2      labeling_0.3        base64enc_0.1-3    
[67] gtable_0.3.0        codetools_0.2-16    multtest_2.42.0    
[70] reshape_0.8.8       curl_4.3            reshape2_1.4.3     
[73] R6_2.4.1            zoo_1.8-8           permute_0.9-5      
[76] stringi_1.4.5       parallel_3.6.2      crul_1.0.0         
[79] IRdisplay_0.7.0     Rcpp_1.0.3          vctrs_0.3.0        
[82] tidyselect_1.1.0 

I don't understand the does not exist in current working directory part of the message. I tried copying my .Renviron to that directory, but that didn't help.

nick-youngblut commented 3 years ago

Updating to taxize 0.9.99 did not fix the issue. I get the same error

nick-youngblut commented 3 years ago

A reprex:

library(taxize)
names <- species_plantarum_binomials[1:500,]
names$species <- paste(names$genus, names$epithet)
classification(names$species, db = 'ncbi', batch_size=5)

The following output is generated:

══  500 queries  ═════════════

Retrieving data for taxon 'Acalypha australis'

✔  Found:  Acalypha+australis

Retrieving data for taxon 'Acalypha indica'

✔  Found:  Acalypha+indica

Retrieving data for taxon 'Acalypha virginica'

✔  Found:  Acalypha+virginica

Retrieving data for taxon 'Acanthus ilicifolius'

✔  Found:  Acanthus+ilicifolius

Retrieving data for taxon 'Acanthus maderaspatensis'

Not found. Consider checking the spelling or alternate classification

Retrieving data for taxon 'Acanthus mollis'

✔  Found:  Acanthus+mollis

Retrieving data for taxon 'Acanthus spinosus'

✔  Found:  Acanthus+spinosus

Retrieving data for taxon 'Acer campestre'

✔  Found:  Acer+campestre

Retrieving data for taxon 'Acer monspessulanum'

✔  Found:  Acer+monspessulanum

Retrieving data for taxon 'Acer negundo'

✔  Found:  Acer+negundo

Retrieving data for taxon 'Acer pensylvanicum'

✔  Found:  Acer+pensylvanicum

Retrieving data for taxon 'Acer platanoides'

✔  Found:  Acer+platanoides

Retrieving data for taxon 'Acer pseudo-platanus'

✔  Found:  Acer+pseudo-platanus

Retrieving data for taxon 'Acer rubrum'

✔  Found:  Acer+rubrum

Retrieving data for taxon 'Acer saccharinum'

✔  Found:  Acer+saccharinum

Retrieving data for taxon 'Acer tataricum'

✔  Found:  Acer+tataricum

Retrieving data for taxon 'Achillea abrotanifolia'

Not found. Consider checking the spelling or alternate classification

Retrieving data for taxon 'Achillea aegyptiaca'

Not found. Consider checking the spelling or alternate classification

Retrieving data for taxon 'Achillea ageratum'

✔  Found:  Achillea+ageratum

Retrieving data for taxon 'Achillea alpina'

✔  Found:  Achillea+alpina

Retrieving data for taxon 'Achillea atrata'

✔  Found:  Achillea+atrata

Retrieving data for taxon 'Achillea bipinnata'

Not found. Consider checking the spelling or alternate classification

Retrieving data for taxon 'Achillea clavennae'

✔  Found:  Achillea+clavennae

Retrieving data for taxon 'Achillea cretica'

✔  Found:  Achillea+cretica

Retrieving data for taxon 'Achillea falcata'

Not found. Consider checking the spelling or alternate classification

Retrieving data for taxon 'Achillea impatiens'

✔  Found:  Achillea+impatiens

Retrieving data for taxon 'Achillea inodora'

Error: '{"error":"API rate limit exceeded","api-key":"XXXXXXXXXXXXXXXXXXX","count":"11","limit":"10"}
' does not exist in current working directory ('/ebio/abt3_projects/my_project/').
sckott commented 3 years ago

Thanks for the issue.

The error is happening, as you can maybe see above in your traceback, the xml reading attempt. An error occurs because we get an NCBI timeout and the xml code expects XML, not a plain string, and since the xml code knows it's not xml i think it tries then to see if the string is a path to a file, and the file is not found, hence that error.

I've fixed the above so it errors better, with just the rate limit message. you can reinstall from here for the update

sckott commented 3 years ago

The rate limiting you are hitting is in get_uid. If you already have NCBI taxon ids, then you can pass those to classification and take advantage of batch querying to avoid rate limits, but since you are passing in names, we first need to use get_uid to get the NCBI taxon ids. One workaround with the package as is:

names <- species_plantarum_binomials[1:500,]
names$species <- paste(names$genus, names$epithet)
res = lapply(names$species[1:100], function(w) {
  Sys.sleep(1) # sleep for a second, possibly less to avoid rate limit
  get_uid(w)
})
res <- as.uid(res, check = FALSE) # don't check that ids are valid, much faster
classification(res, db = 'ncbi', batch_size=10)
sckott commented 3 years ago

I may see if we can have a sleep setting in get_uid

nick-youngblut commented 3 years ago

Thanks for making the quick update!

I got around the issue earlier today by batching myself:

chunk = function(x, n=4){
    split(x, factor(1:length(x)%%n))
}

.classification = function(x, db, batch){
    res = classification(x, db=db, batch=batch)
    Sys.sleep(8)
    return(res)
}

classification_batch = function(x, db = 'ncbi', batch=8){
    cls = chunk(x, as.integer(length(x) / batch)) %>%
        lapply(.classification, db = 'ncbi', batch=batch) %>%
        do.call(c, .)
    return(cls)
}

I had to include a long sleep (8 sec) between batches to avoid an http request error.

sckott commented 3 years ago

glad you got a solution

sckott commented 3 years ago

just pushed another change. reinstall to get it. you can now adjust ncbi sleep time between requests:

taxize_options(ncbi_sleep = 0.5) # just as an e.g., you can adjust this time
names <- species_plantarum_binomials[1:500,]
names$species <- paste(names$genus, names$epithet)
classification(names$species[1:100], db = 'ncbi', batch_size=5)