Closed pssguy closed 8 years ago
FYI
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C LC_TIME=English_Canada.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readr_1.0.0 ggplot2_2.1.0 rtweet_0.2.6
loaded via a namespace (and not attached):
[1] Rcpp_0.12.7 assertthat_0.1 withr_1.0.2 digest_0.6.10 grid_3.3.1 R6_2.1.3 plyr_1.8.4
[8] jsonlite_1.0 gtable_0.2.0 magrittr_1.5 scales_0.4.0 httr_1.2.1 stringi_1.1.1 reshape2_1.4.1
[15] curl_1.2 syuzhet_1.0.0 devtools_1.12.0 tools_3.3.1 stringr_1.1.0 munsell_0.4.3 colorspace_1.2-6
[22] memoise_1.0.0 openssl_0.9.4 tibble_1.2
The stringi package should have a few different ways to deal with the trouble characters. These answers on StackExchange seem promising.
I've been hestitant to touch the returned text b/c my guess is there are cool things one can do with non-ascii characters--and I don't want people to have to reverse engineer the default functions to get access to those. I'll look into though and see if I can find relatively non-intrusive ways of adding a couple textual-analysis friendly filters.
After I add a couple new API call functions and finish configuring some print/plot methods (I've done all the hard parts for these things, so this shouldn't take me too long), the next step will be bulking up the documentation until I'm blue in the face. I will make sure to reference other useful packages and functions.
I had tried the inconv(0 function without any joy and have also now played around with a few stringi functions without luck. Encoding is one of my blind spots (especially Friday p.m.) so any help you can provide would be welcome
Did you not find an issue with your data when doing the sentiment analyses for the vignette?. If it is linked to emojis (which as you say might be a fruitful source of analyses), I'm surprised none occurred in 10000 tweets
It's definitely something with your settings. Mine seems to deal with emojis/Unicode just fine, I thought maybe you had run into a quirk in the sentiment analysis package.
For now I'd keep looking into stringi/stringr functions https://www.r-bloggers.com/icu-unicode-text-transforms-in-the-r-package-stringi/. I'll try to look into this more when I can as well.
I think I have hit something similar before - down to Windows and locale?
I saved dt$text to a file test.csv
This was one that appeared to be causing issues
dt$text[18] #[1] "@quominus I'm pretty sure we've had this discussion but London! \xf0\u009f\u008e\u0080\xf0\u009f\u0092\u0098\xf0\u009f\u0092\u0096\xf0\u009f\u008c�\xf0\u009f\u008c�"
head(dt$text,3)
[1] "RT @T64Pamela: @quominus @MethadoneBaby Potus just gave a commandment to a private business to cease and desist, against a court ruling. Ma…"
[2] "RT @T64Pamela: @quominus @MethadoneBaby Obama overruled a judge's ruling. This is unlawful regardless of the agency. This is the act of a d…"
[3] "@quominus @SyracuseU or I guess just mindful* communication"
eval(parse("test.csv", encoding="UTF-8"))
Error in parse("test.csv", encoding = "UTF-8") :
test.csv:3:16: unexpected '@'
2: "RT @T64Pamela: @quominus @MethadoneBaby Potus just gave a commandment to a private business to cease and desist, against a court ruling. Ma…"
3: RT @T64Pamela: @
^
So I never even reached the problem doing that line of code
I set encoding explicitly to utf8 when parsing json object. Hopefully that fixes the problem. Let me know either way!
Doesn't look like it :( Upgraded to 0.2.6
dt$text[19]
[1] "Recalling standardized test experiences: During the PSAT I threw up the blueberry pancakes my mom made me for breakfast. \xf0\u009f\u0098\u0082 @quominus"
Encoding(dt$text[19])
[1] "UTF-8" # I think before it was "unknown"
syuzhet::get_nrc_sentiment(dt$text)
Error in tolower(char_v) :
invalid input 'Recalling standardized test experiences: During the PSAT I threw up the blueberry pancakes my mom made me for breakfast. 😂 @quominus' in 'utf8towcs'
Okay, I think I figured it out. I was able to replicate the error, and then was able to fix it. There's now a clean_tweets
function that deals with the error.
# if this produces error
tolower(data$text)
# clean tweets
data$text <- clean_tweets(data$text)
# no error
tolower(data$text)
Until I understand the full implications of the conversion to ASCII in these situations, I've not made it the default (out of fear that it'll be stripping otherwise valuable information).
Tx. had a brief look and seems to solve issue. The one set i have looked at the tolower() was sufficient
Working through the vignette - presumably different set of tweets
Is there an easy way to exclude problem tweets. Not that I am trying to exclude tweets from black women LOL!