mjockers / syuzhet

An R package for the extraction of sentiment and sentiment-based plot arcs from text
334 stars 72 forks source link

German language delivers wrong polarity #41

Open stefannisch opened 1 year ago

stefannisch commented 1 year ago

Analysing German text delivers wrong polarity

my_example_text <- "Frau am Bahnhof geschlagen und ausgeraubt Täter sind 7 Araber auch andere Fahrgäste sind wohl Opfer geworden"
get_nrc_sentiment(my_example_text, language = "german")

This German text is quite easy and should deliver negative emotions somehow I get everywhere "0"

But by using I get somehow a result that can be comprehended which is "-3" get_sentiment(my_example_text, method = "nrc", language = "german")

Can you help to fix it?

el-grudge commented 1 year ago

I think you might need to apply stemming/lemmatization (i.e. convert the words to their root source, as they would appear in a dictionary) to the text.

I was able to reproduce your output. Even when translated to English, the output remains the same. However, when I used lemmatization, the output was much more aligned with what you'd expect

my_example_translated <- "Woman beaten and robbed at the train station Perpetrators are 7 Arabs and other passengers have probably become victims"
get_nrc_sentiment(my_example_translated)
# output
# anger anticipation disgust fear joy sadness surprise trust negative positive
# 0        0       0    0   0       0        0     0        0        0

# lemmatize the text using cleanNLP's cnlp_annotate function
my_example_annotated <- cleanNLP::cnlp_annotate(
  my_example_translated,
  backend = NULL,
  verbose = 10,
  text_name = "text",
  doc_name = "doc_id"
)

# create a sentence from the list of tokenized lemmas. clnp_annotate returns a tibble 'token' 
# that contains information about the text such as its lemma, position in sentence (POS), etc...
# we only need the lemmas though, which can be accessed at $token$lemma
my_example_lemmatized <- paste(unlist(my_example_annotated$token$lemma), collapse=' ')

# now try get_nrc_sentiment on the lemmatized sentence
get_nrc_sentiment(my_example_lemmatized)
# output:
# anger anticipation disgust fear joy sadness surprise trust negative positive
# 3            1       2    3   0       3        0     0        3        0

I am not sure if you can use cnlp_annotate on German, but maybe there's some other function that can be achieve the same thing on German text. Hope this helps!

stefannisch commented 1 year ago

@el-grudge I would agree with you but somehow the dictionary isn't implemented as expected. I used this code to get the german dictionary out of the syuzhet package: dictionary <- get_sentiment_dictionary(dictionary = "nrc", language = "german")

and when I looked for matching results for my text I get following results:

If it where only the word "Täter" it would give me the hint with issues on special characters. But the other word don't have any special character. I only have the impression that the dictionary isn't somehow properly used.

Maybe it has also something to do with the warning r gives me when I use the code:

Warning message:
`spread_()` was deprecated in tidyr 1.2.0.
ℹ Please use `spread()` instead.
ℹ The deprecated feature was likely used in the syuzhet package.
  Please report the issue to the authors.
el-grudge commented 1 year ago

@stefannisch, actually there is a problem with the special character 'ä'. Luckily, there's a quick - although imperfect - fix.

There are 2 issues here:

1. The German nrc lexicon is not in lowercase This is not the case with the English version (although, for some weird reason the word 'true' only appears in all caps!). Anyways, you can easily deal with this by setting the lowercase parameter to FALSE:

my_example_text <- "Frau am Bahnhof geschlagen und ausgeraubt Täter sind 7 Araber auch andere Fahrgäste sind wohl Opfer geworden"
syuzhet::get_nrc_sentiment(my_example_text, language = "german", lowercase = FALSE)

# output
# anger anticipation disgust fear joy sadness surprise trust negative positive
#     2            0       1    3   0       3        0     1        3        0

This should fix your problem, however, these scores are not 100% accurate, which leads me to the second point

2. The letter 'ä' is not recognized in the regex parameter of the strsplit call within the get_nrc_sentiment function Here's the original call from the source code: word_l <- strsplit(char_v, "[^A-Za-z']+")

As a result, the word 'täter' gets split into t & ter, neither one is recognized within the German lexicon. If the letter was correctly recognized, you would get the following scores:

# output when fixing the regex expression to include the letter 'ä'
# anger anticipation disgust fear joy sadness surprise trust negative positive 
#     3            0       2    4   0       4        0     1        5        0

Notice how the values increase. I'll fork the code and attempt a solution, but it might take some time

stefannisch commented 1 year ago

Thank you @el-grudge that helped already a lot. I guess than the letter "ä", "ö" and "ü" are quite troublesome...

That sounds awsome thank you very much on working on it.

stefannisch commented 1 year ago

oh didn't want to clsoe it

mjockers commented 1 year ago

Hi Stefan, since leaving academia, I rarely find time to work on this package anymore. Support for non-English languages is weak. I encourage you to fork the repo and develop a solution.