tarakc02 / ratelimitr

Rate limiting for R functions
Other
40 stars 0 forks source link

Rate issue on HIBPwned #14

Closed stephlocke closed 7 years ago

stephlocke commented 7 years ago

A test started failing on HIBPwned that looks to be due to instability in the rate limiter. To replicate this issue I ran a microbenchmark evaluating the function 100 times with a warm up of 1 - which should mean all executions in the benchmark get rate limited to 1.5 seconds or greater. The results indicate that at least 25% of the evaluations are not adhering to a 1.5s rate limit:

> res
Unit: seconds
                                                       expr      min      lq     mean
 ncol(account_breaches(acct, domain = "adobe.com")[[acct]]) 1.045447 1.49788 1.541082
   median       uq      max neval
 1.533501 1.586741 2.042345   100

It could be that I'm not utilising ratelimitr in the best way - could you take a look and determine whether the problem is in my code or ratelimitr?

HIBPwned architecture

account_breaches

account_breaches<-function(
   accounts
  ,truncate=FALSE
  ,domain=NULL
  ,...){

  if(length(accounts)==0|!inherits(accounts,"character")) stop("Problematic accounts")
  if(length(truncate)!=1 |!inherits(truncate,"logical")  ) stop("Problematic truncate")
  if(!is.null(domain)&(
    length(domain)!=1  |!inherits(domain,"character"))  ) stop("Problematic domain")

  encoded<-urltools::url_encode(accounts)
  URLS<-paste0("https://haveibeenpwned.com/api/breachedaccount/"
               ,encoded)

  if(truncate) URLS<- urltools::param_set(URLS,"truncateResponse","true")
  if(!is.null(domain)) URLS<- urltools::param_set(URLS,"domain",urltools::url_encode(domain))

  res<-lapply(URLS, GETcontent, HIBP_headers(...))
  names(res)<-accounts

  return(res)

}

Underlying functions

The rate limit is implemented centrally so all user surfaced funcctions are governed by the same limit

#' Used to extract content from a GET request and collapse to a data.frame
#'
#' @param ... Pass through arguments
#'
#' @return data.frame
GETcontent<-function(...){
  resp<-get(...)
  code<-httr::status_code(resp)
  content<-httr::content(resp ,as="text"
                         ,encoding="utf-8")
  if(code >200) res<-data.frame(name=NA)
  if(code==200) res<-jsonlite::fromJSON(content)
  return(res)
}

get <- ratelimitr::limit_rate(
  httr::GET,
  ratelimitr::rate(n = 1, period = 1.5)
)

MWE

if(!require(HIBPwned))
  install.packages(HIBPwned)
if (!require(microbenchmark))
  install.packages(microbenchmark)
library(HIBPwned)
library(microbenchmark)

acct<-"steff.sullivan@gmail.com"

my_check <- function(values) {
  all(sapply(values[-1], function(x)
    identical(values[[1]], x)))
}

res<-microbenchmark(
  ncol(account_breaches(acct, domain = "adobe.com")[[acct]]),
  times = 100,
  check = my_check,
  control = list(warmup = 1)
)

res
tarakc02 commented 7 years ago

Hi @stephlocke, thanks for reporting the problem and for the reproducible example. I ended up writing a long response that goes into a lot of detail, but the gist of it is that I believe the issue you're experiencing is related to #13, for which a fix is currently in devel and I'll be submitting an update to CRAN soon. My reasoning is below:


I did find a rate limit error recently that I've been looking into (#13), but it was occurring about 2-5 times out of every 1,000, not 1 out of every 4 (!). So I'm trying to figure out if your test failing is a manifestation of #13, in which case the version currently in the devel branch fixes the problem and I'll be submitting the update to CRAN shortly, or something else. I forked HIBPwned and have run the tests several times without any failing tests, which also leads me to believe that the problem may be #13 (which crops up infrequently so would be hard to replicate by just running the tests a few times).

When I run your example, I'm also seeing around 20 of the 100 trials taking < 1.5s, as you are (way more than I'd expect based on #13). However, my_check never seems to fail -- is there a reason for this? I re-created the account_breaches function without the ratelimiting and then ran it twice in a row and the second result was in fact a 429 error, so I'd expect if a rate limit was broken while running the example that my_check would fail.

I know microbenchmark uses the warmups to measure overhead and then subtracts out the estimated overhead from execution times to try to get a "true" measure of a function's execution time. But ratelimitr doesn't need to ensure each function call takes > 1.5 seconds, only that > 1.5 seconds have elapsed between successive calls. That my_check isn't failing suggests to me that, at least when I'm running your example, the rate limit isn't actually being broken. If not, then it might be that microbenchmark is not measuring the correct thing for our purposes. What we'd like to know for certain is that httr::GET is not being called more than once in 1.5 seconds. So I mocked up the following example, using Sys.sleep(runif(1)) as a stand-in for what's happening in the GET request, but mock_account_breaches also records the exact time that the fake GET request is about to be called.

mock_account_breaches <- function() {
    tmp <- microbenchmark::get_nanotime()
    Sys.sleep(runif(1, min = 0, max = 1))
    tmp
}

mock_account_breaches <- ratelimitr::limit_rate(
    mock_account_breaches, 
    ratelimitr::rate(n = 1, period = 1.5)
)

Now to test:

iter <- 100
res <- replicate(iter, mock_account_breaches())

lagged_res <- c(rep(NA, 1), res[seq_len(iter - 1)])

# get_nanotime records time to the nanosecond
summary((res - lagged_res) / 1E9)

And what I see:

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.529   1.533   1.533   1.533   1.533   1.538       1 

Which suggests to me that most of the time the rate limit is being obeyed in the real account_breaches function. Has the HIBPwned test been failing regularly, or has it only failed once or twice? If the latter, then I'm pretty sure that the forthcoming update (currently on the devel branch) will solve your issue.

stephlocke commented 7 years ago

It's only failing recently ... on CRAN and then rerunning the tests on my windows machine results in the same thing

I copied my_check from the microbenchmark example but I'm not sure it's doing what I think it should be doing

tarakc02 commented 7 years ago

I see. Well your code looks like it ought to work, so in response to your original question, the problem must be ratelimitr. I'll continue working on this.

tarakc02 commented 7 years ago

Ok, after updating my R installation to 3.4.0, I am able to consistently get the same test failure as the one appearing in the CRAN check. The upcoming patch to ratelimitr (currently in devel) does not resolve the issue. However, based on the HIBPwned API documentation, which says "It's advisable to avoid querying the API at exactly the rate limit as network behaviour may result in some requests arriving within the retry period and causing a 429. Adding a 100 millisecond delay between requests will usually ensure this won't happen," I went and changed the definition of HIBPwned:::get to use ratelimitr::rate(n = 1, period = 1.6) (rather than period = 1.5) and now the tests are passing!

I'll still go ahead and submit the patch based on #13 to CRAN, but for this issue your best bet is to update HIBPwned:::get to:

get <- ratelimitr::limit_rate(
  httr::GET,
  ratelimitr::rate(n = 1, period = 1.6)
)

That seems to fix the problem, with or without the forthcoming updates to ratelimitr.

stephlocke commented 7 years ago

I upped it and submitted it to CRAN but I think I'm also missing some vital concept about ratelimitr.

I expect that if I set a ratelimit, R will not emit a request until the rate limit period expires e.g. R will wait 1.5s since the last request was completed to send another one. With network latency, that means I usually expect each request to hit approximately 1.51s after the last request.

From what you're saying about the 3.4.0 upgrade it sounds like before then our code worked in that perhaps R added extra computation/processing time so things took that little bit longer and therefore didn't error as often. Perhaps now the rate limit is still being respected but because of the recent R optimisations, it's all executing a little quicker?

tarakc02 commented 7 years ago

Yeah I've been trying to figure out what about 3.4.0 could cause the change, your theory makes sense to me (though unsure how to test it). Here's a rough outline of the way ratelimitr currently works, at least in the case of HIBPwned. Say you use the account_breaches function:

  1. ratelimitr looks up the last timestamp it recorded. If it has been < 1.5 (or 1.6) seconds, it calls Sys.sleep for the appropriate amount of time.
  2. ratelimitr then prepares for the GET request by recording the current timestamp
  3. send GET request

What you are suggesting, which I think makes sense, is that it should be like this instead:

  1. look at last timestamp, sleep if necessary
  2. send GET request
  3. after results are returned, record the current timestamp

That would better respect the fact that time passes between sending the GET request and the HIBPwned server receiving said request. When I've been developing and testing the package, my goal has been to ensure that the request is not emitted more than once per 1.5 seconds, but you've helped me realize the goal should be to make sure that the request is not received more than once per 1.5 seconds.

You can imagine an extreme scenario, where due to a network issue a GET request takes a full 1.5 seconds to reach the HIBPwned server (but the results take .1 seconds to return). By the time the function completes execution, ratelimitr believes that it is free to send another request immediately, but the server won't accept another request from you for another 1.4 seconds. If the next request you send only takes .1 seconds to get there, you've broken the rate limit (from the HIBPwned server's point of view, which is the only one that matters) even though ratelimitr worked as designed.

I'll work on changing the order of when the timestamp is recorded, to work like the second outline above. It seems like a simple change, but will require some reorganizing.

tarakc02 commented 7 years ago

I wrote some unit tests that mimic the "extreme scenario" outlined above, and updated the package to have the rate limiter record timestamps after a function executes. The new version should be much more robust for use with web APIs. I'll be submitting the update to CRAN in the next day or two. Also, with the new updates, the MWE that you wrote above works out to:

     min       lq     mean   median      uq     max neval
1.600398 1.708108 1.719228 1.718625 1.73168 1.92302   100