Implement standard error handling for Qualtrics API errors

chrisumphlett commented 3 years ago

In my experience getting a 500 error from Qualtrics is a normal, intermittent issue. When using a function that will need to iterate many times to get its results (a lot of survey responses from fetch_survey or many contacts from fetch_distribution_history), if it gets that error one time, you have to start over. In some cases I have not been able to ever see the function finish.

I opened a ticket with Qualtrics, who gave this explanation:

Qualtrics' servers return a 500 error when they've met such a spike in volume that they can't handle all calls being made at that moment. The good news is that this means that any calls returning a 500 error can simply be sent again without changing any parameters with the expectation that the second iteration goes through without issue.

Qualtrics strives to maintain a 500 return rate of roughly one percent, and in the past week, calls made from your license have had a 0.99% 500 rate. Should this rate cross 4% or you find that a single call (that is, one payload and one endpoint) consistently returns 500 errors, please do reach out and let us know, but for the time being, what you're experiencing is "within spec." With the implementation of logic in your automation to retry calls that fail with a 500 status, you'll be set to go!

The easiest and best thing would probably be to have a single error-handling function and then wrap it around each call to the qualtrics API. Below is an example of how I've done this. I don't think this is exactly the right solution though. For one, this is wrapped around the fetch_survey function, not the call inside the function. This works for me because I'm looping through many surveys, and if one survey fails, I can just re do the fetching on the entire thing and it's not a big deal because I'm never pulling a lot of results from any one survey.

for (i in 1:length(active_survey_ids)) {
    # for (i in 22:22) {
    print(paste0("Getting response data for ", active_survey_names[i]))
    tryCatch({
      # fetch responses for each survey
      # This will often fail the first time. Try up to three times
      # https://stackoverflow.com/questions/28969070/assigning-a-value-in-exception-handling-in-r/28969896#28969896

      attempt_fetch <- 1
      while(attempt_fetch != 4){
        responses <- try(responses <- fetch_survey(surveyID = active_survey_ids[i], verbose = TRUE, force_request = TRUE,
                                                   start_date = as.character(as.Date(as_of_date) - 6), end_date = as.character(as.Date(as_of_date) + 1)) %>%
                  rename_all(.funs = tolower))
        if (class(responses)[1] == "try-error") {
          Sys.sleep(5) ## Wait 5 seconds so that transient connection issues can go away
          attempt_fetch <- attempt_fetch + 1
        } else {
          break
        } 
      }
### MORE POST PROCESSING CODE
}

I'm also not 100% sure that this is working when used to wrap the call. I used it on @dsen6644's developmental list_distribution_links function. It works to get me through, but, I don't get exactly the same # of results. I think it's ultimately skipping a page rather than retrying it, and/or getting the same page twice, with the way I implemented it (see below). When it would fail, I wouldn't get a 2nd message for that iteration with "attempt 2 of 4" and I'm not sure why. In the end one df had 21,441 rows, the other had 21,444.

I was expecting 21,086. After removing duplicates from each, both df's had 20,946 rows.

while(!is.null(fetch_url)){
    tryCatch({
      attempt_fetch <- 1
      while(attempt_fetch != 4){
        message(paste0("Iteration ", iter, " - attempt ", attempt_fetch, " of 4"))
        try(res <- qualtrics_api_request("GET", url = fetch_url))
        if (class(res)[1] == "try-error") {
          Sys.sleep(5) ## Wait 5 seconds so that transient connection issues can go away
          attempt_fetch <- attempt_fetch + 1
        } else {
          break
        }
      }
    })
    elements <- append(elements, res$result$elements)
    fetch_url <- res$result$nextPage
    iter <- iter + 1
  }

I'm far from an expert on error-handling so I offer this is a potential starting place but there may be some other method that is much better for utilizing within the package.

chrisumphlett commented 2 years ago

closing per my comment on #206

chrisumphlett commented 2 years ago

Spoke too soon... this is still a problem. Have a ticket open with Qualtrics support. I don't think it's a rate limit issue, you get 3000 requests per minute. Per support "504 errors usually indicate timing out and this could be due to the size of the API request."

Hence, having a try/except wrapper on the call, and/or, some spacing out of the calls, might still help.

chrisumphlett commented 2 years ago

Ok, I have a lot more information and I hope better suggestions now.

As I stated above, 504 errors are timeouts per Qualtrics support. Their best suggestion for mitigating this is... drumroll... to submit smaller requests. In the case of list_distribution_links() they specifically recommended breaking up the distributions. I personally think that's a non-starter. In my practical case, we already have large distributions and large surveys, so we could change our approach going forward but I don't want to inconvenience the folks doing the research.

For fetch_survey() (where I've had some issues in the past), the start/end date parameters do allow one to limit what they're querying. The results also aren't paginated, so even if it fails, you just try it again. In the case of list_distribution_links() and fetch_distribution_history() (I'm not sure if I've had timeout errors as part of that, it's certainly less frequent), you must submit many paginated requests to retrieve everything for a large distribution. The bottom line: you can download 100k survey responses at once, but if you have 10k distributions you'll have to submit many requests and if one of them times out, you lose everything in the current implementation.

It's not a rate-limit issue, those limits are really high

What I think we should do:

Change the api endpoint URL's to point to the datacenter and not the organization. The way list_distribution_links() works, my first request goes to {{orgdomain}}.xxx.qualtrics.com. Subsequent requests, per the nextPage parameter, go to {{datacenterid}}.qualtrics.com. xxx was a value that now I don't remember where it came from. {{datacenterid}} can be looked up in account settings, per this article.. There is a cost to not submitting to the datacenter: Per Qualtrics support, the request has to get re-routed. This might prevent timeout errors (and it could matter more for things like fetch_survey() where there is only a single request). This would be a (breaking?) change to make across the entire package, and requires the end user to go and look up a different value in order to use it.
Per Qualtrics, "timeout errors are usually resolved after retrying the API request." This is what I've had success with in the past, having it retry up to 4 times for any error. It could probably be limited to just 504 errors. I believe we should do this for all three functions I've mentioned, it's most important for list_distribution_links().
Include 504 errors in qualtrics_response_codes().

juliasilge commented 2 years ago

I have been looking at revamping the whole package to httr2 because it has much nicer handling of rate limiting, retries, etc. Maybe this is the motivation we need to finally do it.

chrisumphlett commented 2 years ago

I'm not familiar with httr2. I have used httr::RETRY() with other API wrappers I've made. Hadn't thought about using it in this case, the more crude try/except was what I had set up prior to an intern showing me httr::RETRY().

So, using httr::RETRY() could be a fairly simple way of handling this I think without refactoring everything right now.

chrisumphlett commented 2 years ago

@juliasilge what are you thinking, for the short-term? I have my process set up to run with my modified version of the code rather than the package, and, I don't know if there's anyone else who is using this on large distributions to where it really matters.

juliasilge commented 2 years ago

I have a question @chrisumphlett on the first point about the base_url. What we say right now is this:

The base URL you pass to the qualtRics package should either look like yourdatacenterid.qualtrics.com or like yourorganizationid.yourdatacenterid.qualtrics.com, without a scheme such as https://. The Qualtrics API documentation explains how you can find your base URL.

And then folks can click through to this article to find their base URL. (We can't unfortunately link directly to that URL because of CRAN rules about redirects.) I believe almost everyone should be using a base URL that is like {{datacenterid}}.qualtrics.com already, and then we use that to generate URLs for API requests. What do you think will need to change here?

On your second point, are you saying you want to automatically retry behind the scenes for a 504 error a couple of times? Let's implement that for the most important function, see how it goes, and then extend it later to more functions.

We definitely should include 504 errors in qualtrics_response_codes(). 👍

chrisumphlett commented 2 years ago

Regarding the URL: for my organization, the "xxx" in orgid.xxx.qualtrics.com is not the same as the datacenterid in yourdatacenterid.qualtrics.com (per the instructions at the Qualtrics link). Maybe the issue is more with the instructions/documentation than the way the URL is constructed. Either way it's a minor issue, it works as-is, but it's not what Qualtrics recommends.

2nd point-- yes, retry on errors. Perhaps not limited to 504. Maybe limited to 50X.

chrisumphlett commented 2 years ago

@juliasilge I made a simple change to utils.R, and changed httr::VERB() to httr::RETRY() in the qualtrics_api_request() function. This puts the error handling logic on all calls. Image below shows it working for both the fetch_distribution_history() and list_distribution_links() calls.

The 2nd call was 267 pages. There were 11 total failures, on one occasion it was 2 in a row. So the default times=3 was sufficient, though it would probably be good to raise that to at least 4. (I'm not sure if that means 3 retries, or, 3 total tries. I'd want at least 3 retries).

juliasilge commented 2 years ago

Oh, that seems pretty great @chrisumphlett! I think let's give that a go.

juliasilge commented 2 years ago

This is now addressed and we can close it, correct @chrisumphlett?

chrisumphlett commented 2 years ago

yes!

chrisumphlett commented 2 years ago

Ugh

chrisumphlett commented 2 years ago

nevermind. I think I needed to restart my session.

I do think we should add something to the documentation and/or console messaging, will open a different issue for that.

ropensci / qualtRics

Implement standard error handling for Qualtrics API errors #217