Closed sfirke closed 3 years ago
It's a choose-one multiple choice, with an "other" option and accompanying text box. Several people entered "other" text and choice and only one response is presenting this way, the above one. This is baffling. Only idea I have is the trailing whitespace in that person's open response answer?
This person has a bizarre double response. Dunno how it's mucking up the other field like this, and if it's separate, but there's something else going on. They have identical repeated responses throughout.
I traced this back to the json pull in get_responses
and when I look at parsed_content$data
the weird doubleness is already there.
The same person (respondent 5 on API, see above for ID etc.) looks normal in Excel.
Might need to talk to SurveyMonkey about this.
I could de-duplicate at the level when x is formed, e.g., slice(1)
after grouping by unique questions, to trim this data.frame in half x %>% dplyr::select(-quiz_options) %>% get_dupes()
. The slice(1)
might resolve the "other" record that is the one non-true-duplicate pair out of 100 records, too. But this seems like a bit of a hack for a case I do not understand.
Just ran into this again with survey id 166507207
, response id 10612765589
. Debugging this through the level where I calculate x
- can't say yet what happens upstream of that - they have a mostly double response, 49 of 50 questions the same. A single question 220028310
is an "other" question and for that, they have two different responses.
Survey 183426325
also has two cases of a single respondent having a duplicated response to a single multiple choice Q (the two cases have different duplicated question numbers)
idea: offer a force_parse = FALSE
argument that performs a de-duplicating slice()
so that the survey can parse.
Have the regular parse_survey()
call error and send them to this page to add more detail, for maybe taking this up with SurveyMonkey someday to actually resolve it at the source. But then also print in the message that they can re-run the call with "force_parse = TRUE` to implement the hacky fix above, which will - in practice for the three cases so far - yield the desired result.
Having my 🍰 and eating it too?
I am running into this issue and have a few (kind of) unique situations going on:
I set up my API pull like this:
Pull survey from API
survey_fetch <- surveymonkey::fetch_survey_obj(287637453)
Create dataframe from API data
survey_preserve <- surveymonkey::parse_survey(survey_fetch)
The parse_survey
function gave me the error that sent me to this issue (Error: There are duplicated rows in the responses, maybe a situation like #27 - file a bug report at https://github.com/tntp/surveymonkey/issues)
However, I've been able to re-run parse_survey
to get my data with duplicated rows, and have used the following to aggregate rows on id, which is my custom variable:
survey <- survey_preserve %>%
group_by(ID) %>%
summarise(across(.cols = everything(), .fns = ~ .[!is.na(.)][1]))
Unfortunately can't share this data, but where there are duplicated rows, there are differing values for two columns: type and required
Finally, thanks for this package and the development that has gone into it. It has really improved the workflow and access to our data and insights, and has been a game changer for my team.
I have no experience with package development, but am very interested in learning, and would be more than happy to help QA any updates with this issue.
Thanks, Matt
First off, thanks for the issue and kind words. It's motivating to hear this has been a help to others!
I am not 100% confident but in the absence of more info, my hypothesis there's some edge case on SurveyMonkey's side that is causing this problematic extra data to leak through the API. I can trace it to the JSON I get back from the API, but then that's as far as I can go in investigating.
I do have my proposed ideas for de-duplication above, which looks like your workaround: where a respondent somehow has two values for a question, grab the first non-missing one. I could add that as an option to parse_survey()
, still printing a warning about what's happening. If/when I implement that, maybe you can try it out on your survey.
Only other question I have for you: you mention two problematic questions, how many rows out of how many have multiple responses for a single question? In my cases there would just be 1 or 2 bad respondent records out of ~100 per survey. Which makes me think it's not just a matter of a certain question type, but something that the respondent did, like clicking the "back" button and re-answering.
Using get_dupes + tabyl
from janitor
(maybe you've heard of it? 😆). Total survey responses at the time of this data pull was about 1900.
The table below shows duplicate values based on custom variable (ID):
dupe_count n percent
2 77 28.8%
3 159 59.6%
4 21 7.9%
5 2 0.7%
6 5 1.9%
7 1 0.4%
8 1 0.4%
10 1 0.4%
Total 267 100.0%
Where there are two rows per respondent, it is as you mentioned - the survey respondent click into the survey once and the survey ended because of logic/disqualification. Then, they went back into the survey to take again.
Here is some dput that illustrates a typical case with three rows for a respondent, which should actually count as a single valid take, but is structured across three rows for some reason.
structure(list(type = c(NA, "city", "state"), required = c(NA,
TRUE, TRUE), respondent_id = c("12345678910", "12345678910",
"12345678910"), date_created = c("2020-07-17 00:00:00", "2020-07-17 00:00:00",
"2020-07-17 00:00:00"), date_modified = c("2020-07-17 11:11:11",
"2020-07-17 11:11:11", "2020-07-17 11:11:11"), id = c("abc123",
"abc123", "abc123"), do_you_choose_yes_or_no = structure(c(1L,
NA, NA), .Label = c("Yes", "No", "Unsure"), class = "factor"),
what_is_your_choice_this_time = structure(c(1L, NA, NA), .Label = c("Yes",
"No"), class = "factor"), x_name = c(NA_character_, NA_character_,
NA_character_), x_company = c(NA_character_, NA_character_,
NA_character_), x_address = c(NA_character_, NA_character_,
NA_character_), x_address_2 = c(NA_character_, NA_character_,
NA_character_), x_city_town = c(NA, "New York", NA), state = c(NA,
NA, "NY"), x_zip_postal_code = c(NA_character_, NA_character_,
NA_character_), x_country = c(NA_character_, NA_character_,
NA_character_), x_email_address = c(NA_character_, NA_character_,
NA_character_), x_phone_number = c(NA_character_, NA_character_,
NA_character_), zip_code = c("10001", NA, NA), unique_identifier = c("12345",
NA, NA)), row.names = c(NA, 3L), class = "data.frame")
I took my original problem cases above to SurveyMonkey, their engineering team agreed that it was a bug but also said it has likely been fixed since then (I was experiencing it on old surveys). So there may not be more to my cases, though other question types, etc. as noted in #68 could be causing this.
Thanks @sfirke!! Interesting to hear about SurveyMonkey's insight. We sometimes use surveys as forms, and oftentimes programmers will use the "contact information" question type (seen below). I am convinced that this is the culprit of the issue I've explained above, as well as #62.
I couldn't pinpoint the exact code chunk, but since the type
and required
fields make each row unique, I think it was causing issues with a distinct
call somewhere.
@mattroumaya I think I'm getting the same error as you! Although, when I try and read in other surveys I get a column called "status" in the place that I used to get the columns "type" and "required"
Did you manage to find what was causing your problem?
@shamahutoto in the commit referenced above, I dropped the type and required columns so you shouldn't see them anymore when using parse_survey. Did you use those columns for a specific purpose?
@mattroumaya sorry no I don’t use those columns, that was just the one different thing that I noticed before parse_survey stopped working for me. Now it’s telling me I have duplicates.
When I save the survey as csv there seem to be no duplicates.
survey
166507207
, response id10612765589
, question id220028310
has two rows in theresponses
data.frame. Where choice id1511253599
appears in thechoice_id
column and theother_id
column.