ropensci / crul

R6 based http client for R (for developers)
https://docs.ropensci.org/crul
Other
107 stars 17 forks source link

How do you avoid exceeding the rate limit when making API calls? #156

Closed moldach closed 3 years ago

moldach commented 3 years ago

I'm getting an error when exceeding rate limit when attempting to make 2000 API calls to https://api.targetsafety.info/

library(crul)
library(readr)
library(jsonlite)

# Get list of 2000 urls for API calls on 
# I've omitted this list because it contains private key
tmp = read_csv("testURLs.txt")

(cc <- Async$new(
  urls = unlist(tmp)
))

(res <- cc$get())

# Let's take a look at one of the later API calls to see what sort of status we get back?

jsonlite::fromJSON(res[[1800]]$parse())

$status [1] 429

$error [1] "Too many requests, Please try again in 5 minutes."

How does one limit the number of concurrent API calls to avoid this?

sckott commented 3 years ago

Thanks for the issue @moldach

will have a look

sckott commented 3 years ago

@moldach Did you try this yet https://docs.ropensci.org/crul/reference/AsyncQueue.html

moldach commented 3 years ago

Thanks for pointing me in the right direction @sckott

I'm still a bit confused as the documentation for simple async mentions how to parse the HttpClient results, e.g.:

(cc <- Async$new(
  urls = c(
    'https://httpbin.org/get?a=5',
    'https://httpbin.org/get?a=5&b=6',
    'https://httpbin.org/ip'
  )
))
(res <- cc$get())
res[[1]]$parse("UTF-8")

However, the same doesn't work for AsyncQueue():

reqlist <- list(
  HttpRequest$new(url = "https://httpbin.org/get")$get(),
  HttpRequest$new(url = "https://ropensci.org/blog")$get(),
  HttpRequest$new(url = "https://ropensci.org/careers")$get()
)
out <- AsyncQueue$new(.list = reqlist, bucket_size = 5, sleep = 3)
out
out$request() # make requests
out$responses() # list responses
out$parse()  ### Returns character(0)

character(0)

Or this:

out[[1]]$parse("UTF-8")

Error in out[[1]] : wrong arguments for subsetting an environment

P.S. Also, how does one query a specific response, say 10,000th of 500,000 urls? out$responses() tries to print all to the console and is therefore limited by max.print() in RStudio.

sckott commented 3 years ago

The usage for AsyncQueue is a bit different from Async because of the nature of what it takes to do http requests from q queue. With AsyncQueue the responses go into a bucket (a list essentially), and instead of returning a new object, you use the same object to access the results (responses).

Until it's fixed, to iterate through responses with AsyncQueue you can do lapply, or similar like:

lapply(out$responses(), function(x) x$parse())

Also, how does one query a specific response, say 10,000th of 500,000 urls...

What do you mean by "query a specific response"?

I'll consider maybe making a S3 print method for the $responses() method so that you don't have to deal with a huge dump of text to the console ...

moldach commented 3 years ago

What do you mean by "query a specific response"?

I need to make 372,059 API calls and unfortunately the documentation (for the API I am querying) doesn't specify what the rate-limit is; therefore I need to try and find it by trial-and-error.

Ultimately, I would like to be able to check for status=429 errors:

Too many requests, Please try again in 5 minutes.

So for example, it would be nice to check a particular response() for API call #10,000 to see if I'm making too many calls.

sckott commented 3 years ago

Right now AsyncQueue is configured to be blocking. That is, when you run $request() you have to wait for it to finish all requests before doing anything else in that R console. We could change the class so it's not blocking, and then you could inquire about the status of requests, etc. There's tradeoffs to this since you then have to make sure to check when all requests are complete, etc. BUT that would be a significant change and would take some time to do. Opened an issue for that

sckott commented 3 years ago

@moldach your example above for AsyncQueue should work now

moldach commented 3 years ago

Took a shot at updating to the latest version remotes::install_github("ropensci/crul") but I'm still getting the same error for the example above:

reqlist <- list(
  HttpRequest$new(url = "https://httpbin.org/get")$get(),
  HttpRequest$new(url = "https://ropensci.org/blog")$get(),
  HttpRequest$new(url = "https://ropensci.org/careers")$get()
)
out <- AsyncQueue$new(.list = reqlist, bucket_size = 5, sleep = 3)
out
out$request() # make requests
out$responses() # list responses
out$parse()  ### Returns character(0)
out[[1]]$responses() ### Returns Error in out[[1]] : wrong arguments for subsetting an environment
sckott commented 3 years ago

Did you install from this github repository? like remotes::install_github("ropensci/crul"), then make sure to restart R before trying again. working for me right now. and out[[1]]$responses() should be out$responses()[[1]]

moldach commented 3 years ago

out[[1]]$responses() should be out$responses()[[1]]

Changing this fixed it.

Thanks! ❤️

sckott commented 3 years ago

great!