Problem with sentiment analysis

pssguy commented 8 years ago

Working through the vignette - presumably different set of tweets

> sa_trump <- syuzhet::get_nrc_sentiment(dt$text)
Error in tolower(char_v) : 
  invalid input 'RT @CharMckenney: I am a black woman, educated (BA,MBA), independent thinker. I support Trump. ðŸ™‹ðŸ½@realDonaldTrump @Women4Trump #Trump2016 #â€¦' in 'utf8towcs'

Is there an easy way to exclude problem tweets. Not that I am trying to exclude tweets from black women LOL!

pssguy commented 8 years ago

FYI

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Canada.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_1.0.0   ggplot2_2.1.0 rtweet_0.2.6 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7      assertthat_0.1   withr_1.0.2      digest_0.6.10    grid_3.3.1       R6_2.1.3         plyr_1.8.4      
 [8] jsonlite_1.0     gtable_0.2.0     magrittr_1.5     scales_0.4.0     httr_1.2.1       stringi_1.1.1    reshape2_1.4.1  
[15] curl_1.2         syuzhet_1.0.0    devtools_1.12.0  tools_3.3.1      stringr_1.1.0    munsell_0.4.3    colorspace_1.2-6
[22] memoise_1.0.0    openssl_0.9.4    tibble_1.2

mkearney commented 8 years ago

The stringi package should have a few different ways to deal with the trouble characters. These answers on StackExchange seem promising.

I've been hestitant to touch the returned text b/c my guess is there are cool things one can do with non-ascii characters--and I don't want people to have to reverse engineer the default functions to get access to those. I'll look into though and see if I can find relatively non-intrusive ways of adding a couple textual-analysis friendly filters.

mkearney commented 8 years ago

After I add a couple new API call functions and finish configuring some print/plot methods (I've done all the hard parts for these things, so this shouldn't take me too long), the next step will be bulking up the documentation until I'm blue in the face. I will make sure to reference other useful packages and functions.

pssguy commented 8 years ago

I had tried the inconv(0 function without any joy and have also now played around with a few stringi functions without luck. Encoding is one of my blind spots (especially Friday p.m.) so any help you can provide would be welcome

Did you not find an issue with your data when doing the sentiment analyses for the vignette?. If it is linked to emojis (which as you say might be a fruitful source of analyses), I'm surprised none occurred in 10000 tweets

mkearney commented 8 years ago

It's definitely something with your settings. Mine seems to deal with emojis/Unicode just fine, I thought maybe you had run into a quirk in the sentiment analysis package.

For now I'd keep looking into stringi/stringr functions https://www.r-bloggers.com/icu-unicode-text-transforms-in-the-r-package-stringi/. I'll try to look into this more when I can as well.

pssguy commented 8 years ago

I think I have hit something similar before - down to Windows and locale?

I saved dt$text to a file test.csv

This was one that appeared to be causing issues

dt$text[18] #[1] "@quominus I'm pretty sure we've had this discussion but London! \xf0\u009f\u008e\u0080\xf0\u009f\u0092\u0098\xf0\u009f\u0092\u0096\xf0\u009f\u008c�\xf0\u009f\u008c�"

head(dt$text,3)

[1] "RT @T64Pamela: @quominus @MethadoneBaby Potus just gave a commandment to a private business to cease and desist, against a court ruling. Ma…"
[2] "RT @T64Pamela: @quominus @MethadoneBaby Obama overruled a judge's ruling. This is unlawful regardless of the agency. This is the act of a d…"
[3] "@quominus @SyracuseU or I guess just mindful* communication"    

eval(parse("test.csv", encoding="UTF-8"))

Error in parse("test.csv", encoding = "UTF-8") : 
  test.csv:3:16: unexpected '@'
2: "RT @T64Pamela: @quominus @MethadoneBaby Potus just gave a commandment to a private business to cease and desist, against a court ruling. Maâ€¦"
3: RT @T64Pamela: @
                  ^

So I never even reached the problem doing that line of code

mkearney commented 8 years ago

I set encoding explicitly to utf8 when parsing json object. Hopefully that fixes the problem. Let me know either way!

pssguy commented 8 years ago

Doesn't look like it :( Upgraded to 0.2.6


dt$text[19]
[1] "Recalling standardized test experiences: During the PSAT I threw up the blueberry pancakes my mom made me for breakfast. \xf0\u009f\u0098\u0082 @quominus"

Encoding(dt$text[19])
[1] "UTF-8"  # I think before it was "unknown"

syuzhet::get_nrc_sentiment(dt$text)
Error in tolower(char_v) : 
  invalid input 'Recalling standardized test experiences: During the PSAT I threw up the blueberry pancakes my mom made me for breakfast. ðŸ˜‚ @quominus' in 'utf8towcs'

mkearney commented 8 years ago

Okay, I think I figured it out. I was able to replicate the error, and then was able to fix it. There's now a clean_tweets function that deals with the error.

# if this produces error
tolower(data$text)

# clean tweets
data$text <- clean_tweets(data$text)

# no error
tolower(data$text)

Until I understand the full implications of the conversion to ASCII in these situations, I've not made it the default (out of fear that it'll be stripping otherwise valuable information).

mkearney commented 8 years ago

https://github.com/mkearney/rtweet/commit/6b9a4aa55327cbf640458898df33926de3badf75

pssguy commented 8 years ago

Tx. had a brief look and seems to solve issue. The one set i have looked at the tolower() was sufficient

ropensci / rtweet

Problem with sentiment analysis #10