Open Astelix opened 5 years ago
Thanks @Astelix! @stefan-mueller want to verify and fix?
Thanks! I am aware of this, but the original dictionary indicates negations though "not" in the categories neg_negative
and neg_positive
. Thus, changing the forms to "nicht" or "keine" would also imply changing the entries in the original dictionary. Otherwise, negations will not be detected. I am not sure whether we should touch the dictionary entries. What do you think?
library(quanteda.dictionaries)
head(data_dictionary_Rauh$neg_positive, 15)
#> [1] "not aalen" "not abbauwürdig"
#> [3] "not abfangschirm" "not abgefahren"
#> [5] "not abgeheilt" "not abgehend"
#> [7] "not abgeklärtheit" "not abgelagert"
#> [9] "not abgemacht" "not abgeschlossenheit"
#> [11] "not abgesichert" "not abgestimmt"
#> [13] "not abgeworben" "not abgleich"
#> [15] "not abgleichen"
From the original dictionary:
pattern replacement feature
sentiment
1999 (nicht|nichts|kein|keine|keinen) aalen NOT_aalen NOT_aalen -1
21164 (nicht|nichts|kein|keine|keinen) aalglatt NOT_aalglatt NOT_aalglatt
1
21165 (nicht|nichts|kein|keine|keinen) aasen NOT_aasen NOT_aasen 1
21166 (nicht|nichts|kein|keine|keinen) aasig NOT_aasig NOT_aasig 1
17540 (nicht|nichts|kein|keine|keinen) abandon NOT_abandon NOT_abandon 1
21167 (nicht|nichts|kein|keine|keinen) abarbeiten NOT_abarbeiten NOT_abarbeiten 1
It would work as from the original dictionary if it's structured as a regular expression dictionary. Unlike glob patterns, the regex would permit us to prefix each positive word with the negation possibilities.
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
dicttest <- dictionary(list(neg_positive = c("^(nicht|nichts|kein|keine|keinen)$ ^abarbeiten$")))
txt <- c(
"etwas nicht abarbeiten und etwas keine abarbeiten",
"etwas abarbeiten und keinen abarbeiten"
)
tokens(txt) %>%
tokens_lookup(dictionary = dicttest, valuetype = "regex", exclusive = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "etwas" "NEG_POSITIVE" "und" "etwas"
## [5] "NEG_POSITIVE"
##
## text2 :
## [1] "etwas" "abarbeiten" "und" "NEG_POSITIVE"
We don't currently have a valuetype
set in the dictionary object class, but we do have an open issue for it (#1264). This would be a good argument for adding that attribute, so that the lookup functions used that as the default rather than "glob"
. That would enable us to make sure that every dictionary was associated with the correct pattern matching type (valuetype).
That would be a very elegant solution. I just asked Christian Rauh what he thinks about this idea.
Great to see interest in the dictionary and thanks again for including it into your fantastic package!
On the issue: The dictionary is structured such that it matches valuetype = "regex"
. Thus (and also more generally), I'd consider adding a valuetype
attribute to the dictionary object class as very convenient from the user perspective.
Note, however, that I would still suggest to first replace the negation patterns in the original text with a compound marker such as "NOT_[token]" (maybe via tokens_replace()
) before retrieving the dictionary counts via tokens_lookup()
or dfm()
. This makes a difference when aggregating the counts to some sentiment score.
For example, directly counting dictionary terms in the string 'nicht abarbeiten' would retrieve one negative and one negated negative hit. Yet having this replaced with 'NOT_abarbeiten' beforehand would retrieve only the negated negative hit.
Hope this helps...
Thanks @ChRauh that's a good point. Could be done in two stages:
dicttest <-
dictionary(list(
neg_positive = c("^(nicht|nichts|kein|keine|keinen)$ ^abarbeiten$"),
positive = "^abarbeiten$"
))
txt <- c(
"etwas nicht abarbeiten und etwas keine abarbeiten",
"etwas abarbeiten und keinen abarbeiten"
)
dfm(txt, dictionary = dicttest, valuetype = "regex")
## Document-feature matrix of: 2 documents, 2 features (0.0% sparse).
## 2 x 2 sparse Matrix of class "dfm"
## features
## docs neg_positive positive
## text1 2 2
## text2 1 2
tokens(txt) %>%
tokens_lookup(dicttest["neg_positive"], valuetype = "regex", exclusive = FALSE) %>%
tokens_lookup(dicttest, valuetype = "regex", exclusive = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "etwas" "NEG_POSITIVE" "und" "etwas"
## [5] "NEG_POSITIVE"
##
## text2 :
## [1] "etwas" "POSITIVE" "und" "NEG_POSITIVE"
@kbenoit Yes, 'piping' it in that order does the trick. Learned something, thanks! Maybe also a useful example for the helpfile in which @stefan-mueller has already flagged the replacement issue.
In the (german) "data_dictionary_Rauh" the negated forms should be "nicht ..." instead of "not ...". For substantivated forms ending on "...ung" it should be "keine".