ropensci / qualtRics

Download ⬇️ Qualtrics survey data directly into R!
https://docs.ropensci.org/qualtRics
Other
215 stars 70 forks source link

column_map, metadata, fetch_survey exported qnames discrepancy for Matrix Questions #144

Closed shaun-jacks closed 3 years ago

shaun-jacks commented 4 years ago

I'm not sure if this is an issue with the library or the qualtrics API itself, but I noticed when I export column_map questions, and compare them to the questions after exporting with fetch_survey (with import_id = FALSE), this discrepancy occurs:

For example: fetch_survey question names: Q6.63_68 column_map names: Q.63_x68

Metadata for question: Matrix Multiple Choice questions with Likert selector, where the exported column names have an 'x' within it, whereas the fetch_survey questions names do not have an 'x' within it.

Also within the the metadata function there is an x prefix for the subquestions.

I haven't double checked but some causes for this is if column_map is calling v2 instead of v3 of the APIs, or if the API itself is returning this discrepancy.

qualtRics_3.1.2

shaun-jacks commented 4 years ago

Link for metadata API https://api.qualtrics.com/reference#get-survey

I noticed column_map calls this API to get a column map, which is for the legacy surveys APIs. So it makes sense that it's mapping the old version of the question names as opposed to v3

shaun-jacks commented 4 years ago

Possible Solution for Metadata: Link for updated metadata API: https://api.qualtrics.com/reference#get-survey-1

Deprecating metadata and instead calling this API at https://api.qualtrics.com/reference#get-survey-1 - GET to https:// base url/API/v3/survey-definitions/survey-id and retrieving the {"result: : {"Questions": {}} } response gives similar metadata as well.

I currently can't find a solution for updating column_map though. Other than creating one yourself with the new question names.

dsen6644 commented 4 years ago

I believe this is an issue with the Qualtrics API, not the package itself. It seems to have happened recently. I've had issues with fetch_survey as well as general exports from the Qualtrics platform returning truncated column names for matrix + constant sum questions. Like your example, mine issues typically occur when the choice export tags contain character elements.

It seems that we need to reconstruct the export tag (i.e the column name) by combining the export_tag and the ChoiceDataExportTags fields for matrix styled (and possibly constant sum styled) questions.

Here is my first attempt at working through this issue, needs a few more tests/clean up before I submit a PR.

survey_question <- function(x){

  question_id <- x$QuestionID
  question_text <- x$QuestionDescription
  data_export_tag <- x$DataExportTag
  data_type <- x$QuestionType
  selector <- x$Selector

  df <- tibble::tibble(question_id = question_id,
                       export_tag = data_export_tag,
                       data_type = data_type,
                       question_text = question_text,
                       selector = selector)

  if(data_type == "Matrix" | data_type == "CS"){

    subquestion <- map_chr(x$Choices, "Display", .default = NA_character_)

    if(data_type == "Matrix") subquestion_export_tags <- purrr::map_chr(x$ChoiceDataExportTags, 1L, .default = NA) # this needs to be changed to if not FALSE then... 
    if(data_type == "CS") subquestion_export_tags <- names(x$Choices)

    seperator <- ifelse(data_export_tag == "" | is_empty(subquestion_export_tags), "", "_")
    subquestion_export_tags <- paste(data_export_tag, subquestion_export_tags, sep = seperator)

    df_subquestion <- tibble::tibble(question_id = question_id,
                                     subquestion = subquestion,
                                     subquestion_export_tag = subquestion_export_tags)

    df <- left_join(df, df_subquestion, by = "question_id")

    df$export_tag <- ifelse(is.na(df$subquestion_export_tag), df$export_tag, df$subquestion_export_tag)

    df <- df[,!(names(df)) == "subquestion_export_tag"]

  }

  return(df)

}

survey_questions <- function(surveyid){

  url <- paste0(Sys.getenv("QUALTRICS_BASE_URL"), "/API/v3/survey-definitions/", surveyid)
  res <- qualtrics_api_request("GET", url = url)
  questions <- res$result$Questions
  df <- map_df(questions, survey_question)
  return(df)

}
jmobrien commented 4 years ago

Came here to comment something similar. I just went back to do some tests. Here's some facts about this issue:

  1. Relates to an option in the "recode" menu for a question called "question export tags" where the specific variable names for rows in a matrix response question can be assigned.

  2. "Question export tags" can be checked or unchecked; however, if unchecked any underlying assigned export names, if present, will remain resident.

  3. There has been a history of qualtrics sometimes invisibly re-arranging numeric recodes because response items got added or removed (this is actually the prime driver of why I'm using the API and tools so I can catch this). It appears from a survey I was just reviewing that this can happen for question export tags associated with rows of matrix questions as well.

  4. fetch_survey using the PREVIOUS version of the API would always use these underlying name recodings, whether or not "Question export tags" was currently checked on the web interface.

  5. Similarly, the API response data used for export_map reports similarly to the older fetch_survey API (understandable, they were paired).

  6. Variable names for the NEW fetch_survey depend on whether that "question export tags" box is presently checked in that (somewhat hidden) web interface menu.
    a. If not checked, will just append numbers indicating what order they were shown on the screen.
    b. If checked, will use the renaming tags.

  7. The NEW (not yet implemented in qualtRics) get survey API presents this strange notion that underneath everything is this idea of separation between "Choices" and "Choice order" working to dictate display and ultimately variable names for what is downloaded. Here's what I'm seeing from my example:

"Choices": {
    "1": {...}
    "2": {...}
    "3": {...}
    "4": {...}
    "5": {...}
    "6": {...}
    "9": {...}
    "10": {...}
}
"ChoiceOrder": [
    0: "6"
    1: "9"
    2: "1"
    3: "10"
    4: "2"
    5: "3"
    6: "4"
    7: "5"
]

// if "question export tags" in web interface IS checked:
"ChoiceDataExportTags": {
    "1": "[itemvarname]_1"
    "2": "[itemvarname]_2"
    "3": "[itemvarname]_3"
    "4": "[itemvarname]_4"
    "5": "[itemvarname]_5"
    "6": "[itemvarname]_6"
    "9": "[itemvarname]_9"
    "10": "[itemvarname]_10"
}
// And the variables you get from fetch_survey() are:
// [itemvarname]_[6,9,1,10,2,3,4,5]_[specific answers]
// note that data is coming out as:
//    Choice[ChoiceOrder], named ChoiceDataExportTags[ChoiceOrder]

// if question export tags is UNchecked:
"ChoiceDataExportTags": false

// And the variables you get from fetch_survey() then are:
// [itemvarname]_[1:8]_[specific answers]
// data is still depending on this underlying "ordering" idea:
//    Choice[ChoiceOrder], named paste0([itemvarnames], "_", ChoiceOrder + 1)

The underlying idea that there's a "pure" "order" of questions separate from what is displayed, and a separate "choice order" that is what's ACTUALLY displayed really explains a lot of the strange behavior we've wrestled with to date. That same set of elements exists for, say, the radio buttons of a likert-style scale that is in most of the questions we used, and is off on the ones I already knew had problems. But I have absolutely zero sense of why this exists or how one might observe, access, or change it from the web interface. It mostly feels like a source of problems.

(Worth noting that w/matrix questions, there's also elements "Answers" and "AnswerOrder", so presumably that final suffix might also face these sorts of things if that somehow got internally rearranged.)

So, summarizing:

  1. new API driving fetch_survey and older API column_map will only be guaranteed to match up if, whenever there is a export name recoding option in the Recoding tab of the web interface, that option is checked.
  2. If not, there might be an underlying hidden re-naming, possibly related to this weird invisible "choice ordering", that might lead to discrepancies.
  3. New "get survey" API would match with data from new fetch_survey, BUT, I don't think it has a comprehensive question export map like is usefully found in the older API. I don't immediately know how you would construct one from that; I think others have discussed this concern on the Qualtrics forums.

PS: worth noting that the web-downloaded CSV's don't match up with any of the above, using either the legacy format (outputs based on the underlying "pure" ordering of choices, renaming with the tags), or the new format (doesn't separate out variables if there is the possibility of multiple responses per row). Qualtrics, what were you thinking.

jmobrien commented 4 years ago

Okay, following up w/more useful info:

First, I saw breakout_sets was added to fetch_survey, so that covers the difference b/w the modern API and web versions.

Second, there's a new place for column mappings: it's attached to response downloads themselves! It's all extra JSON in one row. Makes sense, actually, as the columns for any given download could vary depending on what is asked for.

I needed this, so I added some functionality for myself, and I'm making a pull request now. I set up a new parameter to fetch_survey colmap_attrs that, if TRUE, will add the content needed to construct a column map as attributes to each variable. There's a helper function in utils.R called get_colmap() that provides this. Presumably, it could be reworked to access downloaded response data in a way that outputs a column map df more directly, similar to what column_map() does right now.

In the meantime, I also made a helper function extract_colmap that pulls it from the new dataframe of responses, just for my purposes. It was added separately if people want to look at/play with it, but left unexported for now.

Incidentally, I also fixed the import_ids param in read_survey which wasn't working because it was expecting the old format of the JSON, as well as adding some assertions for breakout_sets. Those should work now.

shaun-jacks commented 4 years ago

Thanks for investigating this @jmobrien! This would be really helpful as I have been relying on a column map as well to merge question metadata with the question names. My current hacky / very inefficient fix has been to call fetch_survey with import_id = TRUE, and one with import_id = FALSE and construct it from there. So this would be great.

jmobrien commented 4 years ago

@shaun-jacks still not sure if this addresses your issues, where you were seeing those additional x's in variable names, e.g. Q.63_x68. Were you able to confirm that we were looking at the same thing?

For instance, I think if we're looking at the same thing, then you going into the recode menu for your question Q.63 and checking "Question export tags", then downloading with the current fetch_survey should result in variable names that match what you're getting column_map. If not, we have other unaddressed issues.

jmobrien commented 4 years ago

Just looking ahead: seems to me the lack of a good column map was probably the main reason to not work towards migrating over to the modern "get survey" endpoint (at https://api.qualtrics.com/reference#get-survey-1).

Don't know how much work that would be, but it has a number of nice new things, like actually including survey flow logic. (QualtricsTools over at github.com/emma-morgan/QualtricsTools, already worked out some flow mapping, albeit w/downloaded qsf's rather than JSON format from the API).

And of course it would more broadly eliminate the variable naming discrepancies.

shaun-jacks commented 4 years ago

@jmobrien I can confirm that you are correct. Toggling Question export tags bring back the old naming scheme that matches what I got within column_map

jmobrien commented 4 years ago

@shaun-jacks okay great. I guess we've covered it, then.

Also, I just added to the pull req. w/some bug-fixing on how import_id's worked, which was creating a bunch of duplicated names with semi-random appends (since duplicate ImportId's are possible w/breakout_sets = TRUE.) Now appends the choiceId as well.

pschatz25 commented 3 years ago

I think I'm running into this issue. I have a survey with multiple types of questions. My goal is to link question meta data to the survey results so I can handle questions differently based on their type. All of the questions in the survey, including matrix table subQuestions, have custom question export tags.

Using fetch_survey() with import_id = FALSE gives me the results with the custom export tags. Using metadata() and looking at the question data I'm not seeing the same custom import_ids for matrix table subQuestions (so subQuestions.#.recode). Instead it seems to be pulling the internal Qualtrics ID number. I think that because the recode values are not sequential, which happens with you add an item and then delete it and Qualtrics doesn't re-number the items.

Using fetch_survey() with import_id = TRUE give me the results with each subQuestion column getting the same QID and R renaming the duplicate columns with sequential numbers. I can match the QID to the question meta data and the sequential renaming of columns will sometimes match up with the subQuestion.#.recode value, but not always, like when an item has been deleted and the subQuestion recode values are not sequential.

Being able to link the fetch_survey() results to metadata() would be a huge help. In cases where custom question export tags are used it does not currently seem possible to do it. The subQuestion.#.recode values need to pull the custom tag rather than the internal Qualtrics ID or the fetch_survey() import_id needs to include the appropriate subQuestion ID for matrix table subQuestions.

jmobrien commented 3 years ago

Some thoughts on this, also #185:

I think I'm running into this issue. I have a survey with multiple types of questions. My goal is to link question meta data to the survey results so I can handle questions differently based on their type. All of the questions in the survey, including matrix table subQuestions, have custom question export tags.

Custom export tags for a user is smart, but for this package I'd imagine you'd want to assume they weren't there (either way it's the same field)

Using fetch_survey() with import_id = FALSE gives me the results with the custom export tags. Using metadata() and looking at the question data I'm not seeing the same custom import_ids for matrix table subQuestions (so subQuestions.#.recode). Instead it seems to be pulling the internal Qualtrics ID number. I think that because the recode values are not sequential, which happens with you add an item and then delete it and Qualtrics doesn't re-number the items.

Likely so, also there may be API version mismatch issues here. I do know an updated metadata() should work (I have a test version that runs, just haven't had time to finish it as a basic switch would break several other helper functions in the package.)

Using fetch_survey() with import_id = TRUE give me the results with each subQuestion column getting the same QID and R renaming the duplicate columns with sequential numbers.

Yes, as you mentioned, this is R's doing from readr, so these won't necessarily match up. Something to think about.

I can match the QID to the question meta data and the sequential renaming of columns will sometimes match up with the subQuestion.#.recode value, but not always, like when an item has been deleted and the subQuestion recode values are not sequential.

Being able to link the fetch_survey() results to metadata() would be a huge help. In cases where custom question export tags are used it does not currently seem possible to do it. The subQuestion.#.recode values need to pull the custom tag rather than the internal Qualtrics ID or the fetch_survey() import_id needs to include the appropriate subQuestion ID for matrix table subQuestions.

The column map that comes with the current response data endpoint used for fetch_survey(), plus an updated metadata() can in theory provide these links. But it turns out it's surprisingly complex--the column map uses an odd format that requires some knowledge and processing to turn into something that can be mapped to elements of metadata(), and on the metadata() side there are further considerations about what parts within a single "question" (i.e. QID) should be linked in. I know from my own work that a general solution is at least possible, but it's nontrivial to implement.

Really, IMO as a first priority I'm wondering if it would be best to just get things over to the new metadata(), though, which would at least put users on a more consistent footing. I'm not sure, though.

jmobrien commented 3 years ago

@juliasilge you've raised questions to me about package scope before, and my experience this year suggests that's really relevant. Is the aim of qualtRics to be mainly a focused tool for access to (relatively) raw data from key API endpoints? Or is the goal to provide a more comprehensive processing solution?

(Or, perhaps a goal of both, but in two separate packages, this one, and something like what QualtricsTools was trying to be?)

juliasilge commented 3 years ago

Thanks for your patience, all! 🙏

It is surprising how complex this question turns out to be. It seems like a good next step is to update metadata(); if any of you want to take a look at the work in the new PR #191 and see what that may or may not solve, that would be helpful. The endpoint being added there is newer and may be more helpful. Perhaps it should eventually be the default, after a deprecation process?

In terms of the question on scope, I don't see this package taking on the summarization-type tasks that QualtricsTools worked on, but some processing so that the data is usable within R (for example, by common data manipulation packages like dplyr) is within scope. For example, I'd like to see the main output here a tibble/dataframe, not a nested list/JSON-type output from an API. This does mean that, say, duplicate columns get unique names, a mismatch to raw content in the API.

jmobrien commented 3 years ago

Can confirm from personal that a metadata() pointing the new endpoint fixes most of this effectively. I rolled my own metadata2() in a project fork; that was pretty easy but really helped clear a lot of this up. But I could see folding it into the original metatdata() like the idea in #191, though, to help make a smoother transition.

What returns from the two endpoints is different enough that any additional functionality depending on the old endpoint, like survey_questions() won't really work anymore. So that would need consideration as far as an integration/transition plan.

lyh970817 commented 3 years ago

I'm trying to put together a package that makes more use of the metadata to generate variable dictionaries and label survey data, as the research group I'm in is constantly sending out lengthy Qualtrics surveys and we want to automate things. Not sure how relevant it is to what you have in mind https://github.com/lyh970817/qualtdict.

juliasilge commented 3 years ago

I believe that the main questions/problems raised in this issue have been addressed in the work done by @jmobrien in #199. You can now access column metadata/mapping via an attribute on the output of fetch_survey() or the new extract_colmap() function:

library(qualtRics)
library(tidyverse)
res <- fetch_survey("SV_5Bxxxxxxxx")
#>   |                                                                              |                                                                      |   0%  |                                                                              |=========================================================             |  82%  |                                                                              |======================================================================| 100%
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   .default = col_character(),
#>   StartDate = col_datetime(format = ""),
#>   EndDate = col_datetime(format = ""),
#>   Progress = col_double(),
#>   `Duration (in seconds)` = col_double(),
#>   Finished = col_logical(),
#>   RecordedDate = col_datetime(format = ""),
#>   RecipientLastName = col_logical(),
#>   RecipientFirstName = col_logical(),
#>   RecipientEmail = col_logical(),
#>   ExternalReference = col_logical(),
#>   LocationLatitude = col_double(),
#>   LocationLongitude = col_double(),
#>   Q1007 = col_double(),
#>   Q1_DO_1 = col_double(),
#>   Q1_DO_2 = col_double(),
#>   Q1_DO_3 = col_double(),
#>   Q1_DO_4 = col_double(),
#>   Q1_DO_5 = col_double(),
#>   SolutionRevision = col_double(),
#>   FL_6_DO_FL_7 = col_double()
#>   # ... with 4 more columns
#> )
#> ℹ Use `spec()` for the full column specifications.
extract_colmap(res) %>%
  filter(sub != "")
#> # A tibble: 15 x 7
#>    qname  description          main        sub        ImportId timeZone choiceId
#>    <chr>  <chr>                <chr>       <chr>      <chr>    <chr>    <chr>   
#>  1 Q1_1   Are you aware of th… Are you aw… Distracti… QID1_1   <NA>     <NA>    
#>  2 Q1_2   Are you aware of th… Are you aw… Texture    QID1_2   <NA>     <NA>    
#>  3 Q1_3   Are you aware of th… Are you aw… Flavor     QID1_3   <NA>     <NA>    
#>  4 Q1_4   Are you aware of th… Are you aw… Color      QID1_4   <NA>     <NA>    
#>  5 Q1_5   Are you aware of th… Are you aw… Nutrition… QID1_5   <NA>     <NA>    
#>  6 Q1_DO… Are you aware of th… Are you aw… Display O… QID1_DO  <NA>     1       
#>  7 Q1_DO… Are you aware of th… Are you aw… Display O… QID1_DO  <NA>     2       
#>  8 Q1_DO… Are you aware of th… Are you aw… Display O… QID1_DO  <NA>     3       
#>  9 Q1_DO… Are you aware of th… Are you aw… Display O… QID1_DO  <NA>     4       
#> 10 Q1_DO… Are you aware of th… Are you aw… Display O… QID1_DO  <NA>     5       
#> 11 FL_6_… FL_6 - Block Random… FL_6        Block Ran… FL_6_DO  <NA>     FL_7    
#> 12 FL_6_… FL_6 - Block Random… FL_6        Block Ran… FL_6_DO  <NA>     FL_8    
#> 13 FL_6_… FL_6 - Block Random… FL_6        Block Ran… FL_6_DO  <NA>     FL_9    
#> 14 FL_6_… FL_6 - Block Random… FL_6        Block Ran… FL_6_DO  <NA>     FL_10   
#> 15 FL_6_… FL_6 - Block Random… FL_6        Block Ran… FL_6_DO  <NA>     FL_11

Created on 2021-01-10 by the reprex package (v0.3.0.9001)

In my testing, this seems to handle the use cases mentioned in this issue pretty well (matrix questions, etc). If you have further problems or questions, please open a new issue! 🙌