Rauh dictionary wrong negated forms

Astelix commented 5 years ago

In the (german) "data_dictionary_Rauh" the negated forms should be "nicht ..." instead of "not ...". For substantivated forms ending on "...ung" it should be "keine".

kbenoit commented 5 years ago

Thanks @Astelix! @stefan-mueller want to verify and fix?

stefan-mueller commented 5 years ago

Thanks! I am aware of this, but the original dictionary indicates negations though "not" in the categories neg_negative and neg_positive. Thus, changing the forms to "nicht" or "keine" would also imply changing the entries in the original dictionary. Otherwise, negations will not be detected. I am not sure whether we should touch the dictionary entries. What do you think?

library(quanteda.dictionaries)

head(data_dictionary_Rauh$neg_positive, 15)
#>  [1] "not aalen"             "not abbauwürdig"      
#>  [3] "not abfangschirm"      "not abgefahren"       
#>  [5] "not abgeheilt"         "not abgehend"         
#>  [7] "not abgeklärtheit"     "not abgelagert"       
#>  [9] "not abgemacht"         "not abgeschlossenheit"
#> [11] "not abgesichert"       "not abgestimmt"       
#> [13] "not abgeworben"        "not abgleich"         
#> [15] "not abgleichen"

Astelix commented 5 years ago

From the original dictionary:

                                       pattern      replacement          feature

kbenoit commented 5 years ago

It would work as from the original dictionary if it's structured as a regular expression dictionary. Unlike glob patterns, the regex would permit us to prefix each positive word with the negation possibilities.

library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
dicttest <- dictionary(list(neg_positive = c("^(nicht|nichts|kein|keine|keinen)$ ^abarbeiten$")))

txt <- c(
  "etwas nicht abarbeiten und etwas keine abarbeiten",
  "etwas abarbeiten und keinen abarbeiten"
)

tokens(txt) %>%
  tokens_lookup(dictionary = dicttest, valuetype = "regex", exclusive = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "etwas"        "NEG_POSITIVE" "und"          "etwas"       
## [5] "NEG_POSITIVE"
## 
## text2 :
## [1] "etwas"        "abarbeiten"   "und"          "NEG_POSITIVE"

We don't currently have a valuetype set in the dictionary object class, but we do have an open issue for it (#1264). This would be a good argument for adding that attribute, so that the lookup functions used that as the default rather than "glob". That would enable us to make sure that every dictionary was associated with the correct pattern matching type (valuetype).

stefan-mueller commented 5 years ago

That would be a very elegant solution. I just asked Christian Rauh what he thinks about this idea.

ChRauh commented 5 years ago

Great to see interest in the dictionary and thanks again for including it into your fantastic package!

On the issue: The dictionary is structured such that it matches valuetype = "regex" . Thus (and also more generally), I'd consider adding a valuetype attribute to the dictionary object class as very convenient from the user perspective.

Note, however, that I would still suggest to first replace the negation patterns in the original text with a compound marker such as "NOT_[token]" (maybe via tokens_replace()) before retrieving the dictionary counts via tokens_lookup() or dfm(). This makes a difference when aggregating the counts to some sentiment score.

For example, directly counting dictionary terms in the string 'nicht abarbeiten' would retrieve one negative and one negated negative hit. Yet having this replaced with 'NOT_abarbeiten' beforehand would retrieve only the negated negative hit.

Hope this helps...

kbenoit commented 5 years ago

Thanks @ChRauh that's a good point. Could be done in two stages:

dicttest <-
  dictionary(list(
    neg_positive = c("^(nicht|nichts|kein|keine|keinen)$ ^abarbeiten$"),
    positive = "^abarbeiten$"
  ))

txt <- c(
  "etwas nicht abarbeiten und etwas keine abarbeiten",
  "etwas abarbeiten und keinen abarbeiten"
)

dfm(txt, dictionary = dicttest, valuetype = "regex")
## Document-feature matrix of: 2 documents, 2 features (0.0% sparse).
## 2 x 2 sparse Matrix of class "dfm"
##        features
## docs    neg_positive positive
##   text1            2        2
##   text2            1        2

tokens(txt) %>%
  tokens_lookup(dicttest["neg_positive"], valuetype = "regex", exclusive = FALSE) %>%
  tokens_lookup(dicttest, valuetype = "regex", exclusive = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "etwas"        "NEG_POSITIVE" "und"          "etwas"       
## [5] "NEG_POSITIVE"
## 
## text2 :
## [1] "etwas"        "POSITIVE"     "und"          "NEG_POSITIVE"

ChRauh commented 5 years ago

@kbenoit Yes, 'piping' it in that order does the trick. Learned something, thanks! Maybe also a useful example for the helpfile in which @stefan-mueller has already flagged the replacement issue.

kbenoit / quanteda.dictionaries

Rauh dictionary wrong negated forms #24