Weird case of an other breaking parse_survey

sfirke commented 5 years ago

survey 166507207, response id 10612765589, question id 220028310 has two rows in the responses data.frame. Where choice id 1511253599 appears in the choice_id column and the other_id column.

sfirke commented 5 years ago

It's a choose-one multiple choice, with an "other" option and accompanying text box. Several people entered "other" text and choice and only one response is presenting this way, the above one. This is baffling. Only idea I have is the trailing whitespace in that person's open response answer?

sfirke commented 5 years ago

This person has a bizarre double response. Dunno how it's mucking up the other field like this, and if it's separate, but there's something else going on. They have identical repeated responses throughout.

I traced this back to the json pull in get_responses and when I look at parsed_content$data the weird doubleness is already there.

The same person (respondent 5 on API, see above for ID etc.) looks normal in Excel.

Might need to talk to SurveyMonkey about this.

I could de-duplicate at the level when x is formed, e.g., slice(1) after grouping by unique questions, to trim this data.frame in half x %>% dplyr::select(-quiz_options) %>% get_dupes(). The slice(1) might resolve the "other" record that is the one non-true-duplicate pair out of 100 records, too. But this seems like a bit of a hack for a case I do not understand.

sfirke commented 5 years ago

Just ran into this again with survey id 166507207, response id 10612765589. Debugging this through the level where I calculate x - can't say yet what happens upstream of that - they have a mostly double response, 49 of 50 questions the same. A single question 220028310 is an "other" question and for that, they have two different responses.

sfirke commented 4 years ago

Survey 183426325 also has two cases of a single respondent having a duplicated response to a single multiple choice Q (the two cases have different duplicated question numbers)

sfirke commented 4 years ago

idea: offer a force_parse = FALSE argument that performs a de-duplicating slice() so that the survey can parse.

Have the regular parse_survey() call error and send them to this page to add more detail, for maybe taking this up with SurveyMonkey someday to actually resolve it at the source. But then also print in the message that they can re-run the call with "force_parse = TRUE` to implement the hacky fix above, which will - in practice for the three cases so far - yield the desired result.

Having my 🍰 and eating it too?

mattroumaya commented 4 years ago

I am running into this issue and have a few (kind of) unique situations going on:

Using custom variables to identify survey respondents
Using SurveyMonkey's 'contact form' fields for City/Town and State, and then using two text boxes for Zip Code and another unique identifier that survey respondents are required to fill in.

I set up my API pull like this:

Pull survey from API survey_fetch <- surveymonkey::fetch_survey_obj(287637453)

Create dataframe from API data survey_preserve <- surveymonkey::parse_survey(survey_fetch)

The parse_survey function gave me the error that sent me to this issue (Error: There are duplicated rows in the responses, maybe a situation like #27 - file a bug report at https://github.com/tntp/surveymonkey/issues)

However, I've been able to re-run parse_survey to get my data with duplicated rows, and have used the following to aggregate rows on id, which is my custom variable:

survey <- survey_preserve %>%
  group_by(ID) %>%
  summarise(across(.cols = everything(), .fns = ~ .[!is.na(.)][1]))

Unfortunately can't share this data, but where there are duplicated rows, there are differing values for two columns: type and required

In my dataset, type contains the values NA, city, or state. I believe this is related to the 'contact form' field mentioned above.
required contains the values NA or TRUE. If type == city or type == state, required == TRUE. Otherwise, type and required are NA.

Finally, thanks for this package and the development that has gone into it. It has really improved the workflow and access to our data and insights, and has been a game changer for my team.

I have no experience with package development, but am very interested in learning, and would be more than happy to help QA any updates with this issue.

Thanks, Matt

sfirke commented 4 years ago

First off, thanks for the issue and kind words. It's motivating to hear this has been a help to others!

I am not 100% confident but in the absence of more info, my hypothesis there's some edge case on SurveyMonkey's side that is causing this problematic extra data to leak through the API. I can trace it to the JSON I get back from the API, but then that's as far as I can go in investigating.

I do have my proposed ideas for de-duplication above, which looks like your workaround: where a respondent somehow has two values for a question, grab the first non-missing one. I could add that as an option to parse_survey(), still printing a warning about what's happening. If/when I implement that, maybe you can try it out on your survey.

sfirke commented 4 years ago

Only other question I have for you: you mention two problematic questions, how many rows out of how many have multiple responses for a single question? In my cases there would just be 1 or 2 bad respondent records out of ~100 per survey. Which makes me think it's not just a matter of a certain question type, but something that the respondent did, like clicking the "back" button and re-answering.

mattroumaya commented 4 years ago

Using get_dupes + tabyl from janitor (maybe you've heard of it? 😆). Total survey responses at the time of this data pull was about 1900.

The table below shows duplicate values based on custom variable (ID):

dupe_count   n percent
          2  77   28.8%
          3 159   59.6%
          4  21    7.9%
          5   2    0.7%
          6   5    1.9%
          7   1    0.4%
          8   1    0.4%
         10   1    0.4%
      Total 267  100.0%

Where there are two rows per respondent, it is as you mentioned - the survey respondent click into the survey once and the survey ended because of logic/disqualification. Then, they went back into the survey to take again.

Here is some dput that illustrates a typical case with three rows for a respondent, which should actually count as a single valid take, but is structured across three rows for some reason.

The columns starting with x are from SurveyMonkey's contact field question.
The additional zip code variable was created using a text box in order to force a 5-digit numeric response.
All questions in the survey were required

structure(list(type = c(NA, "city", "state"), required = c(NA, 
TRUE, TRUE), respondent_id = c("12345678910", "12345678910", 
"12345678910"), date_created = c("2020-07-17 00:00:00", "2020-07-17 00:00:00", 
"2020-07-17 00:00:00"), date_modified = c("2020-07-17 11:11:11", 
"2020-07-17 11:11:11", "2020-07-17 11:11:11"), id = c("abc123", 
"abc123", "abc123"), do_you_choose_yes_or_no = structure(c(1L, 
NA, NA), .Label = c("Yes", "No", "Unsure"), class = "factor"), 
    what_is_your_choice_this_time = structure(c(1L, NA, NA), .Label = c("Yes", 
    "No"), class = "factor"), x_name = c(NA_character_, NA_character_, 
    NA_character_), x_company = c(NA_character_, NA_character_, 
    NA_character_), x_address = c(NA_character_, NA_character_, 
    NA_character_), x_address_2 = c(NA_character_, NA_character_, 
    NA_character_), x_city_town = c(NA, "New York", NA), state = c(NA, 
    NA, "NY"), x_zip_postal_code = c(NA_character_, NA_character_, 
    NA_character_), x_country = c(NA_character_, NA_character_, 
    NA_character_), x_email_address = c(NA_character_, NA_character_, 
    NA_character_), x_phone_number = c(NA_character_, NA_character_, 
    NA_character_), zip_code = c("10001", NA, NA), unique_identifier = c("12345", 
    NA, NA)), row.names = c(NA, 3L), class = "data.frame")

sfirke commented 3 years ago

I took my original problem cases above to SurveyMonkey, their engineering team agreed that it was a bug but also said it has likely been fixed since then (I was experiencing it on old surveys). So there may not be more to my cases, though other question types, etc. as noted in #68 could be causing this.

mattroumaya commented 3 years ago

Thanks @sfirke!! Interesting to hear about SurveyMonkey's insight. We sometimes use surveys as forms, and oftentimes programmers will use the "contact information" question type (seen below). I am convinced that this is the culprit of the issue I've explained above, as well as #62.

I couldn't pinpoint the exact code chunk, but since the type and required fields make each row unique, I think it was causing issues with a distinct call somewhere.

shamahutoto commented 3 years ago

@mattroumaya I think I'm getting the same error as you! Although, when I try and read in other surveys I get a column called "status" in the place that I used to get the columns "type" and "required"

Did you manage to find what was causing your problem?

mattroumaya commented 3 years ago

@shamahutoto in the commit referenced above, I dropped the type and required columns so you shouldn't see them anymore when using parse_survey. Did you use those columns for a specific purpose?

shamahutoto commented 3 years ago

@mattroumaya sorry no I don’t use those columns, that was just the one different thing that I noticed before parse_survey stopped working for me. Now it’s telling me I have duplicates.

When I save the survey as csv there seem to be no duplicates.

mattroumaya / surveymonkey

Weird case of an other breaking parse_survey #27