mattroumaya / surveymonkey

Access your SurveyMonkey data directly from R!
https://mattroumaya.github.io/surveymonkey/
Other
42 stars 10 forks source link

Error: Can't subset columns that don't exist. x Column `survey_id` doesn't exist. #73

Closed sofiavilchis closed 2 years ago

sofiavilchis commented 3 years ago

Hi I have been pulling my surveys for months, and nothing has change in the survey but yesterday when I try parse_survey , it gives me this error "Error: Can't subset columns that don't exist. x Column survey_id doesn't exist."

sfirke commented 3 years ago

Hrm, I can't think what this would be - the package hasn't changed and it sounds like your survey hasn't either. I'll leave the issue open in case it affects others. Does browse_surveys() still work?

ForwardDataWorks commented 2 years ago

Getting a similar error when attempting parse_survey. (browse_surveys works, and fetch_survey_obj works)

New names:
* weight -> weight...4
* weight -> weight...5
* weight -> weight...6
* weight -> weight...7
* weight -> weight...8
* ...
Error: Can't subset columns that don't exist.
x Columns `choice_metadata`, `date_created`, `date_modified`, `ip_address`, and `GoldSlayerID` don't exist.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
Outer names are only allowed for unnamed scalar atomic inputs

I have been using this twice a week for about 6 months, and I most recently used it last Thursday. We are using these surveys for Customer Support, so we have about 20 of them that are identical except for the Support Agent's name. I tried a few random survey IDs, and I get the same error.

sfirke commented 2 years ago

Given that the package is unchanged, and your survey is unchanged, I expect that maybe you installed a newer version of a tidyverse package on which this package depends and that tidyverse package has a breaking change that affects something internal to parse_survey. Might that be the case?

ForwardDataWorks commented 2 years ago

@sfirke thanks for the quick reply. As I mentioned in my original post, I have been using the surveymonkey package twice a week for the past 6 months (on the same group of surveys), and I used it most recently on Thursday of last week. I didn't update any packages in the last four days. Any other suggestions you have would be helpful. I am proficient in R, but I am not a developer.

sfirke commented 2 years ago

If your survey is the same, and all packages are unchanged including surveymonkey, it's hard to think what the culprit might be. It's possible that SurveyMonkey made a change to their API and is exporting differently, but the documentation looks unchanged and that seems unlikely to me.

You could try that suggestion to run rlang::last_error() and see if that log tells anything useful. I no longer have a SurveyMonkey account and have stepped away from this project. So I can't fully debug this myself, even if I wanted to. Maybe @mattroumaya has an idea.

Sorry, I know that's a bummer. To assign blame for this: at my prior company I asked SurveyMonkey to compensate us in-kind for writing and maintaining this package, but their idea of compensation was inadequate. My company switched to a competitor product as a result and abandoned this package. If SurveyMonkey had invested in open source, my former colleagues would still be maintaining it and they could likely troubleshoot this. The package would probably even be on CRAN and have unit tests written that might surface breakage as a result of changes.

For you or anyone who wants to tackle this, no developer experience needed (just R fluency):

The problem is in https://github.com/tntp/surveymonkey/blob/master/R/assemble.R. Whoever wants to dig into this, start with surv_obj < fetch_survey_obj(1234567) [with your survey ID] to load that object into the environment and then start running the code in that linked file for parse_survey() line by line, starting with row 17 where it calls get_responses. Then you'll find precisely which line causes the error, and can look at the objects at the time of the command to see what's going wrong.

mattroumaya commented 2 years ago

Thanks for the additional info, @sfirke. I actually tried to debug this earlier, but couldn't replicate the issue. I even updated tidyverse and other packages to see if it would throw an error across multiple surveys, but had no issues.

@ForwardDataWorks - can you provide any information about the structure of your survey? Any detail you can provide about the types of questions you're using, if there are custom variables or anything unique, would be great.

Not that you haven't done so, but I feel that it's obligatory to suggest restarting your R session and trying to re-run.

Also, it's very lame that SurveyMonkey was not willing to adequately compensate for open-source development. This package greatly benefits myself and my colleagues (also enterprise customers), and we would be more inclined to look elsewhere without this package. I'm happy to help maintain it, and add functionality when possible/warranted.

mattroumaya commented 2 years ago

@ForwardDataWorks, I spoke too soon and did find an instance where I also got the same error. It looks like for some question types, there is an additional field called choice_metadata that is leaking through the API.

Try to use the solution below (as parse_survey2) and let me know if it works for you. If so, I'll submit a PR to fix it for the package.

parse_survey2 <- function(surv_obj, oauth_token = getOption('sm_oauth_token'), ...){
  if(surv_obj$response_count == 0){
    warning("No responses were returned for this survey.  Has anyone responded yet?")
    return(data.frame(survey_id = as.numeric(surv_obj$id)))
  }

  respondents <- get_responses(surv_obj$id, oauth_token = oauth_token, ...)

  # Save response status to join later
  vals <- c("id", "response_status")
  response_status_list <- lapply(respondents, "[", vals)
  status <- do.call(rbind.data.frame, response_status_list)

  responses <- respondents %>%
    surveymonkey:::parse_respondent_list()

  question_combos <- surveymonkey:::parse_all_questions(surv_obj)

  # this join order matters - putting q_combos on left yields the right ordering of columns in final result
  # the joining variables vary depending on question types present, so can't hard-code. Thus squash message
  x <- suppressMessages(dplyr::full_join(question_combos, responses))

  # There should not be duplicate rows here, but putting this here in case of oddities like #27
  assertthat::assert_that(sum(duplicated(dplyr::select_if(x, is.atomic))) == 0,
                          msg = paste0("There are duplicated rows in the responses, maybe a situation like #27 - ", file_bug_report_msg()))

  # questions with only simple answer types might not have some referenced columns, #46
  add_if_not_present <- c(choice_id = NA_character_, choice_position = NA_integer_)
  x <- x %>%
    tibble::add_column(!!!add_if_not_present[!names(add_if_not_present) %in% names(.)])

  # 'type' and 'required' are created when question_type == 'demographic'
  # Drop them because it causes issues with duplicated rows per respondent_id
  # Reference Issue #27, Issue #62
  x$type <- NULL
  x$required <- NULL
  x$choice_metadata <- NULL

  #If question type = Multiple Choice, include choice text + ID in the combined new columns

  x$q_unique_id <- apply(
    x %>%
      dplyr::select(question_id, row_id, col_id, other_id),
    1,
    function(x) paste(stats::na.omit(x), collapse="_")
  )
  x$q_unique_id[x$question_type == "multiple_choice" | x$question_subtype == "multi" & is.na(x$other_id)] <- paste(
    x$q_unique_id[x$question_type == "multiple_choice" | x$question_subtype == "multi" & is.na(x$other_id)],
    x$choice_id[x$question_type == "multiple_choice" | x$question_subtype == "multi" & is.na(x$other_id)],
    sep = "_")

  x$combined_q_heading <- apply(
    x %>%
      dplyr::select(heading, row_text, col_text, other_text),
    1,
    function(x) paste(stats::na.omit(x), collapse= " - ")
  )

  x <- x %>%
    dplyr::mutate(combined_q_heading = dplyr::case_when(question_type == "multiple_choice" & is.na(other_text) ~ paste(combined_q_heading, choice_text, sep = " - "),
                                                        question_type != "open_ended" & question_subtype == "multi" & is.na(other_text) ~ paste(combined_q_heading, choice_text, sep = " - "),
                                                        TRUE ~ combined_q_heading))

  # combine open-response text and choice text into a single field to populate the eventual table
  x$answer <- dplyr::coalesce(x$response_text, x$choice_text)
  assertthat::assert_that(sum(!is.na(x$answer)) == (sum(!is.na(x$response_text)) + sum(!is.na(x$choice_text))),
                          msg = paste0("Uh oh, we failed to account for a combination of open-response text - ", file_bug_report_msg()))
  static_vars <- setdiff(names(x), c("heading", "question_id", "question_type", "question_subtype",
                                     "choice_position", "choice_text", "quiz_options", "choice_id",
                                     "other_id", "other_text", "row_text", "row_id", "description",
                                     "col_text", "response_text", "col_id", "q_unique_id",
                                     "combined_q_heading", "answer"))

  final_x <- x %>%
    dplyr::select(tidyselect::all_of(static_vars), combined_q_heading, answer, q_unique_id)

  qid_text_crosswalk <- final_x %>%
    dplyr::distinct(q_unique_id, combined_q_heading) %>%
    dplyr::mutate(unique_text = de_duplicate_names(combined_q_heading))

  # did a full_join above to make sure that all questions [q_unique_ids] are present in result even if no one answered them
  # but that means the spread will fail b/c there's more than one response per q_unique_id for response_id == NA
  # Adjust for that to spread, then filter that out after spread
  final_x_real <- final_x %>%
    dplyr::filter(!is.na(response_id))

  final_x_dummy <- final_x %>%
    dplyr::filter(is.na(response_id)) %>%
    dplyr::distinct(q_unique_id)

  final_x <- dplyr::bind_rows(final_x_real, final_x_dummy)

  # spread wide
  # get column order to reset to after spread makes alphabetical
  col_names <- c(names(final_x)[!(names(final_x) %in% c("combined_q_heading","answer", "q_unique_id"))], qid_text_crosswalk$unique_text)

  out <- final_x %>%
    dplyr::select(-combined_q_heading) %>%
    dplyr::mutate(q_unique_id = factor(q_unique_id, levels = qid_text_crosswalk$q_unique_id)) %>% # to spread unrepresented levels
    tidyr::pivot_wider(names_from = q_unique_id, values_from = answer) %>%
    dplyr::filter(!is.na(response_id))

  # Takes spread-out results data.frame and turns multiple choice cols into factors.  GH issue #12
  # Doing this within the main function so it can see crosswalk

  master_qs <- x %>%
    dplyr::distinct(q_unique_id, choice_id, question_id, choice_position, choice_text)

  # set a vector as a factor, if it has answer choices associated with its question id
  set_factor_levels <- function(vec, q_id){

    # fetch possible answer choices given a question's text
    get_factor_levels <- function(q_id){
      master_qs %>%
        dplyr::filter(q_unique_id == q_id, !is.na(choice_id)) %>%
        dplyr::arrange(choice_position) %>% # appears to always come from API in order but don't want to assume
        dplyr::pull(choice_text) %>%
        unique() # in case they loaded the same value twice as answer choices, #48
    }

    name_set <- get_factor_levels(q_id)
    if(length(name_set) == 0){
      return(vec)
    } else {
      factor(vec, levels = name_set)
    }
  }
  out <- purrr::map2_dfc(out, names(out), set_factor_levels)

  # reset to text names instead of numbers
  # and then re-order to correct columns
  names(out)[(length(static_vars) + 1):length(names(out))] <- qid_text_crosswalk$unique_text[match(names(out)[(length(static_vars) + 1):length(names(out))],qid_text_crosswalk$q_unique_id)]
  out <- out[, col_names]
  out <- out %>%
    dplyr::arrange(dplyr::desc(response_id)) %>%
    dplyr::rename(respondent_id = response_id)

  # Join response status
  out <- out %>%
    dplyr::left_join(.,status, by = c("respondent_id" = "id")) %>%
    dplyr::select(survey_id, collector_id, respondent_id, date_created, date_modified, response_status, everything())
  out
}

# Helper function for de-duplicating identical Q names
# Input: the vector of names
# Adapted from janitor::make_clean_names()
de_duplicate_names <- function(x){
  dupe_count <- vapply(seq_along(x), function(i) {
    sum(x[i] == x[1:i])
  }, integer(1))
  x[dupe_count > 1] <- paste(x[dupe_count >
                                 1], dupe_count[dupe_count > 1], sep = "_")
  x
}
ForwardDataWorks commented 2 years ago

@mattroumaya Thanks for the code. Unfortunately it is throwing an error:

Error in parse_respondent_list(.) : 
  could not find function "parse_respondent_list"

Any suggestions/edits you can provide would be greatly appreciated.

mattroumaya commented 2 years ago

@ForwardDataWorks sorry about that! I updated the code above, if you copy and paste it again and use parse_survey2, I am hoping it'll work.

ForwardDataWorks commented 2 years ago

Ah, brilliant, @mattroumaya !! Thanks for the help.

mattroumaya commented 2 years ago

Glad it worked for you! I'll submit a PR later to close this out, and you'll be able to update the package, and hopefully won't run into any more issues. I'll also investigate further, because it seems that choice_metadata is being brought in for weighted questions, so we'll need to decide how that gets handled.

mattroumaya commented 2 years ago

PR #76 closes this, you can update the package and should no longer run into this issue. It seems that SurveyMonkey has been doing a bit of work on their API, so we'll have to keep an eye on any similar issues that come up.

sfirke commented 2 years ago

Nice work!