Closed dominiqueemmanuel closed 2 years ago
Hello. As I discuss here (https://github.com/trinker/sentimentr/issues/74#issuecomment-361955888) sentimentr is English based. I don't have the expertise in other languages to understand the ramifications of of extending it beyond it's current state. Your solution may work. Let me look into this more.
x <- c(
"danish characteøs sentåment æcores words correctly 456",
"It works with probleme but not with problème 234"
)
gsub("[^[:alpha:]',;: ]|\\d:\\d|\\d ", '', x)
## "danish characteøs sentåment æcores words correctly " "It works with probleme but not with problème "
This may work. I need to look at [:alpha:]
In some cases alpha may work. In others you may need to explicitly pass in \\p{L}
instead as I show in the demo below:
## Use With Non-ASCII
## Warning: sentimentr has not been tested with languages other than English.
## The example below is how one might use sentimentr if you believe the
## language you are working with are similar enough in grammar to for
## sentimentr to be viable (likely Germanic languages)
## english_sents <- c(
## "I hate bad people.",
## "I like yummy cookie.",
## "I don't love you anymore; sorry."
## )
## Roughly equivalent to the above English
danish_sents <- stringi::stri_unescape_unicode(c(
"Jeg hader d\\u00e5rlige mennesker.",
"Jeg kan godt lide l\\u00e6kker is.",
"Jeg elsker dig ikke mere; undskyld."
))
danish_sents
## > danish_sents
## [1] "Jeg hader dårlige mennesker." "Jeg kan godt lide lækker is."
## [3] "Jeg elsker dig ikke mere; undskyld."
## Polarity terms
polterms <- stringi::stri_unescape_unicode(
c('hader', 'd\\u00e5rlige', 'undskyld', 'l\\u00e6kker', 'kan godt', 'elsker')
)
## Make polarity_dt
danish_polarity <- as_key(data.frame(
x = stringi::stri_unescape_unicode(polterms),
y = c(-1, -1, -1, 1, 1, 1)
))
## Make valence_shifters_dt
danish_valence_shifters <- as_key(
data.frame(x='ikke', y="1"),
sentiment = FALSE,
comparison = NULL
)
sentiment(
danish_sents,
polarity_dt = danish_polarity,
valence_shifters_dt = danish_valence_shifters,
retention_regex = "\\d:\\d|\\d\\s|[^\\p{L}',;: ]"
)
## A way to test if you need [:alpha:] vs \\p{L}
## Does it wreck some of the non-ascii characters by default?
sentimentr:::make_sentence_df2(danish_sents)
## > sentimentr:::make_sentence_df2(danish_sents)
## id sentences wc
## 1: 1 jeg hader d rlige mennesker 5
## 2: 2 jeg kan godt lide l kker is 7
## 3: 3 jeg elsker dig ikke mere ; undskyld 6
## Does this?
sentimentr:::make_sentence_df2(danish_sents, "\\d:\\d|\\d\\s|[^\\p{L}',;: ]")
## > sentimentr:::make_sentence_df2(danish_sents, "\\d:\\d|\\d\\s|[^\\p{L}',;: ]")
## id sentences wc
## 1: 1 jeg hader dårlige mennesker 4
## 2: 2 jeg kan godt lide lækker is 6
## 3: 3 jeg elsker dig ikke mere ; undskyld 6
## If you answer yes to #1 but no to #2 you likely want \\p{L}
Thanks for working on this !
On my side the both [:alpha:]
and \\p{L}
seem OK :
txt <- c("première","dårlige","lækker")
stringi::stri_replace_all_regex(txt ,'[^a-zA-Z;:,\']', " ")==txt
#c(FALSE,FALSE,FALSE)
## >> KO :(
stringi::stri_replace_all_regex(txt ,"[^[:alpha:];:,\']", " ")==txt
# c(TRUE,TRUE,TRUE)
## >> OK :)
stringi::stri_replace_all_regex(txt ,"[^\\p{L};:,\']", " ")==txt
# c(TRUE,TRUE,TRUE)
However I would like to draw your attention on the fact that "[^\\p{L};:,\']",
is compatible with stringi::stri_replace_all_regex
but not with gsub
, so I think you should modifiy this line :
https://github.com/trinker/sentimentr/blob/d96673a423f21683ed969d34e78608ab2e575e9e/R/utils.R#L168
Kind redards, Dom
Thanks I have changed this out!
Hi,
Thanks for this high quality package allowing to calculate text polarity sentiment.
I think there is a problem with accented characters:
It works with
probleme
but not withproblème
. I think it comes from themake_sentence_df2
function that splits words with accented character:More precisely, I think the problem come from here : https://github.com/trinker/sentimentr/blob/6d33a96a3ed758612065ea2666da638d056d6c19/R/utils.R#L168
I think
text.var <- gsub("[^a-z',;: ]|\\d:\\d|\\d ", " ",
may be replaced bytext.var <- gsub("[^[:alpha:]',;: ]|\\d:\\d|\\d ", " ",
What do you think about it?
Best regards, Dominique