trinker / sentimentr

Dictionary based sentiment analysis that considers valence shifters
Other
427 stars 84 forks source link

Any hint why the two emoji approaches are different and in what circumstance which one is better?! #115

Closed jguo1002 closed 3 years ago

jguo1002 commented 4 years ago

In the doc there are two approaches to deal with emojis:

## Emojis
## Not run:
## Load R twitter data
x <- read.delim(system.file("docs/r_tweets.txt", package = "textclean"),
stringsAsFactors = FALSE)
x
library(dplyr); library(magrittr)
## There are 2 approaches

## Approach 1: Replace with words
x %>%
mutate(Tweet = replace_emoji(Tweet)) %$%
sentiment(Tweet)

## Approach 2: Replace with identifier token
combined_emoji <- update_polarity_table(
lexicon::hash_sentiment_jockers_rinker,
x = lexicon::hash_sentiment_emojis
)
x %>%
mutate(Tweet = replace_emoji_identifier(Tweet)) %$%
sentiment(Tweet, polarity_dt = combined_emoji)
## End(Not run)

The result is different. emoji approaches

Is there any hint about why the results are different and in what circumstance which one is better? Thanks!

trinker commented 3 years ago

I would not expect the results to be the same. One is changing the underlying text (thus changing word counts) while the other uses the emojis as matches. So the difference is in the word counts which is used in computing the polarity.

Additionally, in the second example you give you replace the words as well as use the emojis in the sentiment dictionary.