sfirke / janitor

simple tools for data cleaning in R
http://sfirke.github.io/janitor/
Other
1.38k stars 132 forks source link

zap stupid curly single apostrophe / "smart quotes" #91

Closed sfirke closed 7 years ago

sfirke commented 7 years ago

I just spent maybe an hour trying to resolve some stupid apostrophes in SurveyData. It went SurveyMonkey -> .xlsx -> read_excel(). Then it was causing joins on that string to fail.

To me this is a quintessential janitor question in that I'm a Windows user who doesn't understand encoding types. So the little on StackOverflow about this topic was not useful to me.

I finally fixed it with:

mutate(question_text = gsub("[\u2018\u2019\u201A\u201B\u2032\u2035]", "'", question_text))

And that converted the curly apostrophe to a simple one.

I got the list from this gist. Could those all be loaded into a function and it zaps all of the special ones to regular?

Unfortunately I'm having trouble with a reprex here, if I copy-paste the curly quote it doesn't get fixed. But it worked in my data, reproducibly.

sfirke commented 7 years ago

@JakeRuss I thought you'd appreciate the gsub above. Blech.

rgknight commented 7 years ago

The issue here is that it reads in the data accurately but then later doesn't merge correctly? This may be a dplyr bug if yes. Could you include an example?

sfirke commented 7 years ago

Dplyr's okay here, I just notice the problem when "it's" doesn't join with "it's" because one of those single quotes is curled. Upon digging deeper to try for a reprex, I think it's that haven::read_sav is bringing in the smart / curled quote as is in the "label" attr. I don't think I've had this problem working with readxl or readr.

Maybe it's more of a niche problem then? I hate my SurveyMonkey -> R workflow 😢 Need to check in and see where they are at currently with RMonkey...

JakeRuss commented 7 years ago

That's brutal. 😦

sfirke commented 7 years ago

@rgknight here's a reprex you can run in your RStudio:

> x <- "it\u2019s"
> x
[1] "it’s"
> x == "it's"
[1] FALSE
rgknight commented 7 years ago

Yeah.... I think this is pretty unique to your workflow...

I'm not a haven user or a SurveyMonkey user, but if I understand correctly, you are exporting some data from SurveyMonkey as an SPSS file, which is read as UTF-16 or some other non-standard encoding, and some other data from SurveyMonkey into Excel, which is exported and/or read as UTF-8, then trying to merge them, and it is failing where curly ' is downgraded to '.

I ran into some problems using Qualtrics exports where the encoding would change if I loaded it directly after exporting versus if I loaded it after having opened it in Excel. But I was working entirely with csv files, which have better encoding support through readr than either haven or read_xls seem to have, so I handled it through a check_encoding function.

We can maybe take this offline, but I think you can improve your workflow here. It sounds like you use SurveyMonkey a lot. Is there a reason you aren't using the surveymonkey api?

rgknight commented 7 years ago

Looks like this can be resolved by using the csv file instead of the xls file from SurveyMonkey, becuase the xls file encoding is downgraded at some point but the csv file is not. https://help.surveymonkey.com/articles/en_US/kb/CSV-Exports

sfirke commented 7 years ago

Okay, I think you are right. I'll write a personal function for now, close this ticket, and not include in janitor unless there's interest from others.

My SurveyMonkey workflow... sigh. There's the rmonkey package but it's not quite refreshed to work with the latest API which had breaking changes. And the .sav SPSS export from SurveyMonkey has different data than the .csv or .xlsx when it comes to question names and numbers - right now I pull from both of them. Definitely not ideal.