Multiple responses with breakout_sets = FALSE downloading as numerics

kevintroy commented 3 years ago

I don't often fetch_survey(breakout_sets = FALSE), but when I do the multiple response questions are downloading as "pure" strings of numbers without delimiters, which then get read into the tibble as numeric data. For example, if both the second and fifth options are checked, the data will show the number "25."

By contrast, the Qualtrics GUI "export data" will return a CSV with comma-delimited strings, which is what I was expecting. Similarly, any "Display Order" variables pulled down by fetch_survey(breakout_sets=FALSE) come down as pipe-delimited strings of numbers, e.g. ("3|1|2").

I have a feeling this is (another) case of inconsistent Qualtrics APIs, but thought I'd bring it up. Possibly related: https://github.com/ropensci/qualtRics/issues/144

juliasilge commented 3 years ago

I just tried out various configurations with breakout_sets = FALSE and I can't seem to find this problem with my example surveys. Can you add a bit more detail and maybe an example if possible, so we can see if we can handle this better?

kevintroy commented 3 years ago

It occurred to me that this might vary by question type -- the question in the data set I was looking at originally was a Multiple Answer Grid, which is not quite the same as a Multiple Choice Multiple Response question.

So, I've created a test survey that uses every Qualtrics question type that should be impacted by BREAKOUT_SETS. I'll populate this with some data and post examples of what I'm seeing in a few days.

kevintroy commented 3 years ago

I've delved into this further and it's because of this combination of things:

When breakout_sets = FALSE and label = FALSE, the Qualtrics API will return comma-delimited strings of numbers in a single column (for example, if the first, second, and third options are checked off, the API will return "1,2,3").
readr::read_csv ignores these commas, for what I'm sure are reasons

We can see the readr behavior without making any API calls:

library(tidyverse)

test_frame <- tibble(numeric = c(1, 2, 3), 
                     comma_delimited = c("1,2,3", "3,2,1", "2,1,3"), 
                     semi_delimited = c("1;2;3", "3;2;1", "2;1;3"), 
                     pipe_delimited = c("1|2|3", "3|2|1", "2|1|3"))

test_frame %>% 
  write_csv("test.csv")

read_csv("test.csv")
#> 
#> -- Column specification --------------------------------------------------------
#> cols(
#>   numeric = col_double(),
#>   comma_delimited = col_number(),
#>   semi_delimited = col_character(),
#>   pipe_delimited = col_character()
#> )
#> # A tibble: 3 x 4
#>   numeric comma_delimited semi_delimited pipe_delimited
#>     <dbl>           <dbl> <chr>          <chr>         
#> 1       1             123 1;2;3          1|2|3         
#> 2       2             321 3;2;1          3|2|1         
#> 3       3             213 2;1;3          2|1|3
``
Created on 2021-04-12 by the reprex package (v1.0.0)

Note that the comma_delimited column is read in as a double.

One possible "fix" for this would be for fetch_survey to generate a warning when both breakout_sets = F and label = F? The user would then be able to use a col_types specification to make sure the delimited column is read in correctly.

juliasilge commented 3 years ago

Oh boy, separating those values by COMMAS in a COMMA-SEPARATED file 😩

Am I correct in understanding that it isn't all combinations of breakout_sets = FALSE and label = FALSE that are problematic, but just for certain question types? Or is it all question types for this combination?

kevintroy commented 3 years ago

I checked all the question types where breakout_sets is relevant, and the format/behavior is consistent across them all.

juliasilge commented 3 years ago

Here is the new warning for when folks use both breakout_sets = FALSE together with label = FALSE:

library(qualtRics)
fetch_survey("SV_5BJRo2RGHajIlOB", 
             label = FALSE, 
             breakout_sets = FALSE, 
             convert = FALSE,
             force_request = TRUE)
#> Warning: Use caution with `breakout_sets = FALSE` plus `label = FALSE`
#> * Results will likely be incorrectly guessed and read in as numeric
#> * Use a `col_types` specification to override
#>   |                                                                              |                                                                      |   0%  |                                                                              |=========================================================             |  82%  |                                                                              |======================================================================| 100%
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   .default = col_double(),
#>   StartDate = col_datetime(format = ""),
#>   EndDate = col_datetime(format = ""),
#>   IPAddress = col_character(),
#>   RecordedDate = col_datetime(format = ""),
#>   ResponseId = col_character(),
#>   RecipientLastName = col_logical(),
#>   RecipientFirstName = col_logical(),
#>   RecipientEmail = col_logical(),
#>   ExternalReference = col_logical(),
#>   DistributionChannel = col_character(),
#>   UserLanguage = col_character(),
#>   Q1_DO = col_character(),
#>   FL_6_DO = col_character()
#> )
#> ℹ Use `spec()` for the full column specifications.
#> # A tibble: 122 x 38
#>    StartDate           EndDate             Status IPAddress Progress
#>    <dttm>              <dttm>               <dbl> <chr>        <dbl>
#>  1 2020-03-29 20:47:24 2020-03-29 20:48:23      1 <NA>           100
#>  2 2020-03-29 20:50:02 2020-03-29 20:50:02      2 <NA>           100
#>  3 2020-03-29 20:50:02 2020-03-29 20:50:02      2 <NA>           100
#>  4 2020-03-29 20:50:02 2020-03-29 20:50:02      2 <NA>           100
#>  5 2020-03-29 20:50:03 2020-03-29 20:50:03      2 <NA>           100
#>  6 2020-03-29 20:50:03 2020-03-29 20:50:03      2 <NA>           100
#>  7 2020-03-29 20:50:03 2020-03-29 20:50:03      2 <NA>           100
#>  8 2020-03-29 20:50:03 2020-03-29 20:50:03      2 <NA>           100
#>  9 2020-03-29 20:50:03 2020-03-29 20:50:03      2 <NA>           100
#> 10 2020-03-29 20:50:03 2020-03-29 20:50:03      2 <NA>           100
#> # … with 112 more rows, and 33 more variables: Duration (in seconds) <dbl>,
#> #   Finished <dbl>, RecordedDate <dttm>, ResponseId <chr>,
#> #   RecipientLastName <lgl>, RecipientFirstName <lgl>, RecipientEmail <lgl>,
#> #   ExternalReference <lgl>, LocationLatitude <dbl>, LocationLongitude <dbl>,
#> #   DistributionChannel <chr>, UserLanguage <chr>, Q1002 <dbl>, Q1006 <dbl>,
#> #   Q1007 <dbl>, Q1_1 <dbl>, Q1_2 <dbl>, Q1_3 <dbl>, Q1_4 <dbl>, Q1_5 <dbl>,
#> #   Q1_DO <chr>, Q200 <dbl>, Q300 <dbl>, Q201 <dbl>, Q301 <dbl>, Q202 <dbl>,
#> #   Q302 <dbl>, Q203 <dbl>, Q303 <dbl>, Q204 <dbl>, Q304 <dbl>,
#> #   SolutionRevision <dbl>, FL_6_DO <chr>

^{Created on 2021-04-21 by the reprex package (v2.0.0)}

ropensci / qualtRics

Multiple responses with breakout_sets = FALSE downloading as numerics #210