Closed sfirke closed 7 years ago
@JakeRuss I thought you'd appreciate the gsub
above. Blech.
The issue here is that it reads in the data accurately but then later doesn't merge correctly? This may be a dplyr bug if yes. Could you include an example?
Dplyr's okay here, I just notice the problem when "it's" doesn't join with "it's" because one of those single quotes is curled. Upon digging deeper to try for a reprex, I think it's that haven::read_sav
is bringing in the smart / curled quote as is in the "label" attr
. I don't think I've had this problem working with readxl
or readr
.
Maybe it's more of a niche problem then? I hate my SurveyMonkey -> R workflow 😢 Need to check in and see where they are at currently with RMonkey...
That's brutal. 😦
@rgknight here's a reprex you can run in your RStudio:
> x <- "it\u2019s"
> x
[1] "it’s"
> x == "it's"
[1] FALSE
Yeah.... I think this is pretty unique to your workflow...
I'm not a haven user or a SurveyMonkey user, but if I understand correctly, you are exporting some data from SurveyMonkey as an SPSS file, which is read as UTF-16 or some other non-standard encoding, and some other data from SurveyMonkey into Excel, which is exported and/or read as UTF-8, then trying to merge them, and it is failing where curly ' is downgraded to '.
I ran into some problems using Qualtrics exports where the encoding would change if I loaded it directly after exporting versus if I loaded it after having opened it in Excel. But I was working entirely with csv files, which have better encoding support through readr
than either haven
or read_xls
seem to have, so I handled it through a check_encoding
function.
We can maybe take this offline, but I think you can improve your workflow here. It sounds like you use SurveyMonkey a lot. Is there a reason you aren't using the surveymonkey api?
Looks like this can be resolved by using the csv file instead of the xls file from SurveyMonkey, becuase the xls file encoding is downgraded at some point but the csv file is not. https://help.surveymonkey.com/articles/en_US/kb/CSV-Exports
Okay, I think you are right. I'll write a personal function for now, close this ticket, and not include in janitor unless there's interest from others.
My SurveyMonkey workflow... sigh. There's the rmonkey
package but it's not quite refreshed to work with the latest API which had breaking changes. And the .sav SPSS export from SurveyMonkey has different data than the .csv or .xlsx when it comes to question names and numbers - right now I pull from both of them. Definitely not ideal.
I just spent maybe an hour trying to resolve some stupid apostrophes in SurveyData. It went SurveyMonkey -> .xlsx -> read_excel(). Then it was causing joins on that string to fail.
To me this is a quintessential janitor question in that I'm a Windows user who doesn't understand encoding types. So the little on StackOverflow about this topic was not useful to me.
I finally fixed it with:
And that converted the curly apostrophe to a simple one.
I got the list from this gist. Could those all be loaded into a function and it zaps all of the special ones to regular?
Unfortunately I'm having trouble with a reprex here, if I copy-paste the curly quote it doesn't get fixed. But it worked in my data, reproducibly.