ropensci / qualtRics

Download ⬇️ Qualtrics survey data directly into R!
https://docs.ropensci.org/qualtRics
Other
215 stars 70 forks source link

Missing labels for updated response options in fetch_survey() #278

Open context-dependent opened 2 years ago

context-dependent commented 2 years ago

Problem

Recently found several instances of this behaviour. I'm not sure exactly what's going on, but I have some guesses. Below, I compare example_responses_Rics <- fetch_survey() to example_responses_csv <- read_csv() to illustrate.

In the survey, the question labeled satis_overall began with a set of response options that were very slightly altered. The Rics export fails to parse these response options, though the data are present in the untreated csv export.

Starting with the following:

library(tidyverse)
library(qualtRics)

example_responses_Rics <- fetch_survey("SURVEY_ID")
example_responses_Rics |> count(satis_overall)

Counting the column produces

# A tibble: 4 x 2
  satis_overall                                           n
  <ord>                                             <int>
1 Neither satisfied nor dissatisfied          8
2 Very satisfied                                    119
3 NA                                                      11
4 NA                                                      43

While doing the same for the csv export

example_responses_csv <- read_csv("path/to/csv_download.csv")
example_responses_csv |> count(satis_overall)

Shows the updated response uptions, but as unordered character values.

# A tibble: 6 x 2
  satis_overall                                      n
  <chr>                                        <int>
1 Neither satisfied nor dissatisfied     8
2 Somewhat dissatisfied                     3
3 Somewhat satisfied                        36
4 Very dissatisfied                               4
5 Very satisfied                                119
6 NA                                                  11

In the Rics export, the column is a factor, but its levels are outdated (they have a space that was deleted).

levels(example_responses_Rics$satis_overall)
# [1] "Very dissatisfied "
# [2] "Somewhat dissatisfied "
# [3] "Neither satisfied nor dissatisfied"
# [4] "Somewhat satisfied "
# [5] "Very satisfied"
# [6] NA

Thoughts on solution

fetch_survey() uses Qx's metadata to parse factors. The response options metadata appears not to update with changes to the survey, so I don't see a clear path to solving this problem while maintaining both the factor parsing functionality and the fast metadata * csv approach.

It feels like it probably shouldn't fail silently, as in many use cases the qualtRics user and the survey maintainer are not the same person.

The .sav export is bigger than the csv, but it bakes in updated labels that can be converted into factor levels. To me, this adds support for parameterizing the export format in fetch_survey(), though I'm not sure where the zeitgeist is at on that concept.

jmobrien commented 2 years ago

I'm assuming you were using convert = TRUE (as is the default?). And when you say this:

though the data are present in the untreated csv export.

Are you referring to the manual download from the webpage, or something else?

Also, did the changes to the question satis_overall occur before or after the survey was initially published (whether or not it had responses yet)?

I'm thinking this could relate several things, but one main underlying problem is that convert = TRUE is still looking at the older survey description endpoint, which may be the reason why you're experiencing issues.

@juliasilge, I think this is pointing again to the need for the work around #267.

context-dependent commented 2 years ago

@jmobrien this all adds up to me.

The changes to satis_overall happened after publication, and the factor levels seem to point in the direction of the description endpoint.

I was referring to the csv exported from the GUI. I just tried fetch_survey with convert = FALSE and the result is the same as reading in the csv.

jmobrien commented 2 years ago

@juliasilge what would you think about making convert = FALSE the default, at least until we can get things off the v2 endpoint? I know conversion has been the default behavior for a long time, but I'm concerned we might be steering users, maybe esp. new ones, towards potentially problematic behavior.

As @context-dependent noted, that change would just mean that users by default get the same data they get from a web download with default settings (still with some bonuses like cleaning up the metadata row & a column map)

context-dependent commented 2 years ago

@jmobrien changing the default behaviour to convert = FALSE is an elegant and straightforward solution, it could break extant data pipelines, but I'm not sure if that's a dealbreaker.

The short-term options, imo are:

Set default to convert = FALSE

As you suggest, this would give new users data that meets their expectations by default, but may break code that currently relies on the present default behaviour.

Detect and handle conversion parsing failures

This would be a bigger lift, but I think still manageable in the short term. The nugget is getting wrapper_mc() to detect parsing failures. Once that is done, it could then raise a warning, skip conversion, change its behaviour, or some combination of the above.

Interested in thoughts, happy to help however I can.

juliasilge commented 2 years ago

I'm hesitant to change the default behavior for something that has worked the same way for such a long time (predating my own involvement in this package); that is a type of change that you can't really inform users about, apart from a message every single time they use the function (folks hate that). I think it likely would be better to change over to the new endpoint instead, for example supporting the includeLabelColumns option. We can open an issue to track input on changing to convert = FALSE but I think it would be better to update the underlying API call.

jmobrien commented 2 years ago

@context-dependent Thanks for your offer of help. It feels like the real fixes here are intertwined with a lot of other factors, so I'd be reluctant to have you commit much work to something that's likely to get changed.

But, if you're up for it, adding some kind of check + warning for parsing failures might be a great stopgap measure. Basically, where if wrapper_mc() detects elements in some variable not in the list of expected levels obtained from the survey_questions() call, you throw a warning that reports the problem variable(s) and any unmatched elements.

(@juliasilge thoughts? Along w/warning, should we also tack on any unmatched elements to the vector of factor levels? Simple warnings an easy way to preserve existing behavior for now--but, basic users would have a hard time fixing anything after a warning (b/c all non-matched values are now NAs). Appending the levels means that users would need to rearrange, but that's at least possible w/o going back to the web or redownloading.

juliasilge commented 2 years ago

That sounds like a reasonable stopgap measure to me, and I would probably lean toward doing a simple warning to start with (with a recommendation to redownload with convert = FALSE?). We could extend that to some fixing/handling next.

jmobrien commented 2 years ago

That's probably a good balance, yes.

@context-dependent is this something you'd like to tackle? If so, great, and let us know if you have any questions. (If useful, you can see several recent use examples for the rlang::warn() warning tool in /R/Checks.R.)

context-dependent commented 2 years ago

Thanks! I would be happy to give it a shot next week if that works.