Sentimentr split words when there is an accented character

dominiqueemmanuel commented 4 years ago

Hi,

Thanks for this high quality package allowing to calculate text polarity sentiment.

I think there is a problem with accented characters:

library(sentimentr)
key <- data.frame(
  words = c("problème","probleme"),
  polarity = c(-1,-1),
  stringsAsFactors = FALSE
)
mykey <- as_key(key)
sentences <- c("le problème", "le probleme")
sentiment_by(sentences, polarity_dt = mykey)
# element_id word_count sd ave_sentiment
# 1:          1          3 NA     0.0000000
# 2:          2          2 NA    -0.7071068

It works with probleme but not with problème. I think it comes from the make_sentence_df2 function that splits words with accented character:

sentimentr:::make_sentence_df2(sentences)
# id   sentences wc
# 1:  1 le probl me  3
# 2:  2 le probleme  2

More precisely, I think the problem come from here : https://github.com/trinker/sentimentr/blob/6d33a96a3ed758612065ea2666da638d056d6c19/R/utils.R#L168

I think text.var <- gsub("[^a-z',;: ]|\\d:\\d|\\d ", " ", may be replaced by text.var <- gsub("[^[:alpha:]',;: ]|\\d:\\d|\\d ", " ",

What do you think about it?

Best regards, Dominique

trinker commented 2 years ago

Hello. As I discuss here (https://github.com/trinker/sentimentr/issues/74#issuecomment-361955888) sentimentr is English based. I don't have the expertise in other languages to understand the ramifications of of extending it beyond it's current state. Your solution may work. Let me look into this more.

trinker commented 2 years ago

x <- c(
    "danish characteøs  sentåment æcores words correctly 456",
    "It works with probleme but not with problème 234"
)
gsub("[^[:alpha:]',;: ]|\\d:\\d|\\d ",  '', x)

##  "danish characteøs  sentåment æcores words correctly " "It works with probleme but not with problème "

This may work. I need to look at [:alpha:]

trinker commented 2 years ago

In some cases alpha may work. In others you may need to explicitly pass in \\p{L} instead as I show in the demo below:

## Use With Non-ASCII
## Warning: sentimentr has not been tested with languages other than English.
## The example below is how one might use sentimentr if you believe the
## language you are working with are similar enough in grammar to for
## sentimentr to be viable (likely Germanic languages)
## english_sents <- c(
##     "I hate bad people.",
##     "I like yummy cookie.",
##     "I don't love you anymore; sorry."
## )

## Roughly equivalent to the above English
danish_sents <- stringi::stri_unescape_unicode(c(
    "Jeg hader d\\u00e5rlige mennesker.",
    "Jeg kan godt lide l\\u00e6kker is.",
    "Jeg elsker dig ikke mere; undskyld."
))

danish_sents
## > danish_sents
## [1] "Jeg hader dårlige mennesker."        "Jeg kan godt lide lækker is."       
## [3] "Jeg elsker dig ikke mere; undskyld."

## Polarity terms
polterms <- stringi::stri_unescape_unicode(
    c('hader', 'd\\u00e5rlige', 'undskyld', 'l\\u00e6kker', 'kan godt', 'elsker')
)

## Make polarity_dt
danish_polarity <- as_key(data.frame(
    x = stringi::stri_unescape_unicode(polterms),
    y = c(-1, -1, -1, 1, 1, 1)
))

## Make valence_shifters_dt
danish_valence_shifters <- as_key(
    data.frame(x='ikke', y="1"),
    sentiment = FALSE,
    comparison = NULL
)

sentiment(
    danish_sents,
    polarity_dt = danish_polarity,
    valence_shifters_dt = danish_valence_shifters,
    retention_regex = "\\d:\\d|\\d\\s|[^\\p{L}',;: ]"
)

## A way to test if you need [:alpha:] vs \\p{L}
## Does it wreck some of the non-ascii characters by default?
sentimentr:::make_sentence_df2(danish_sents)

## > sentimentr:::make_sentence_df2(danish_sents)
##    id                            sentences wc
## 1:  1         jeg hader d rlige mennesker   5
## 2:  2         jeg kan godt lide l kker is   7
## 3:  3 jeg elsker dig ikke mere ; undskyld   6

## Does this?
sentimentr:::make_sentence_df2(danish_sents, "\\d:\\d|\\d\\s|[^\\p{L}',;: ]")

## > sentimentr:::make_sentence_df2(danish_sents, "\\d:\\d|\\d\\s|[^\\p{L}',;: ]")
##    id                            sentences wc
## 1:  1         jeg hader dårlige mennesker   4
## 2:  2         jeg kan godt lide lækker is   6
## 3:  3 jeg elsker dig ikke mere ; undskyld   6

## If you answer yes to #1 but no to #2 you likely want \\p{L}

dominiqueemmanuel commented 2 years ago

Thanks for working on this !

On my side the both [:alpha:] and \\p{L} seem OK :

txt <- c("première","dårlige","lækker")
stringi::stri_replace_all_regex(txt ,'[^a-zA-Z;:,\']', " ")==txt
#c(FALSE,FALSE,FALSE)
## >> KO  :(
stringi::stri_replace_all_regex(txt ,"[^[:alpha:];:,\']", " ")==txt
# c(TRUE,TRUE,TRUE)
## >> OK  :)
stringi::stri_replace_all_regex(txt ,"[^\\p{L};:,\']", " ")==txt
# c(TRUE,TRUE,TRUE)

However I would like to draw your attention on the fact that "[^\\p{L};:,\']", is compatible with stringi::stri_replace_all_regex but not with gsub, so I think you should modifiy this line : https://github.com/trinker/sentimentr/blob/d96673a423f21683ed969d34e78608ab2e575e9e/R/utils.R#L168

Kind redards, Dom

trinker commented 2 years ago

Thanks I have changed this out!

trinker / sentimentr

Sentimentr split words when there is an accented character #118