Open BobMuenchen opened 7 years ago
There is no function designed specifically for negations, but tokens_ngrams
and tokens_compound
can be used. While tokens_ngrams
generates ngrams of any words, tokens_compound
joins only specified words. For example, tokens_compound(toks, 'not *')
will join negations and following words to generate tokens like 'not_good' or not_nice
.
If you want more generalized expressions of negations, use tokens_lookup
.
positive<- c('good', 'nice')
negative <- c('bad', 'nasty')
dict <- dictionary(pos=c(positive, paste('not', negative)), neg=c(negative, paste('not', positive)))
toks <- tokens("The man is not bad, actually pretty good")
tokens_lookup(toks, dict, exclusive = FALSE)
## tokens from 1 document.
## Component 1 :
## [1] "The" "man" "is" "POS" "," "actually" "pretty" "POS"
However, this method does not work when there are words between negations and sentiment words:
toks2 <- tokens("The man is good, and not going to be bad")
tokens_lookup(toks2, dict, exclusive = FALSE)
## tokens from 1 document.
## Component 1 :
## [1] "The" "man" "is" "POS" "," "and" "not" "going" "to" "be" "NEG"
Thanks for that info! The WordStat approach allows you to specify a gap of N words between the negation term and the sentiment term. Gaps before and after are controlled separately. It's a very helpful approach. I'm sure N is limited to a fairly small number but I don't know it offhand.
I've reopened this issue to include ways to implement the WordStat dictionary rules, described in https://www.provalisresearch.com/Documents/WordStat6.pdf pp69-71.
Ideas:
phrase()
, such as rule()
"rule"
to the permissible values of valuetype
, for dictionaries
I've been using the WordStat Sentiment dictionary:
https://provalisresearch.com/products/content-analysis-software/wordstat-dictionary/
but before I imported it, I removed the negation rules at the top with a text editor and it then works fine as a standard dictionary. WordStat's dictionary negation rules look like this:
NEGATIVE NOT_GOOD @NOTGOOD [#POSITIVE_WORDS AFTER #NEGATIONS /S 3] (1) REAL_BAD @REALBAD [#NEGATIVE_WORDS NOT AFTER #NEGATIONS /S 3] (1)...
If I had left them there, would quanteda have understood them? If so, how would I apply them?
The tidytext package uses ngrams to address negation, as you may have seen here: http://tidytextmining.com/ngrams.html in section 5.1.3 "Using bigrams to provide context in sentiment". However, looking up to four words on either side using logic like that would be quite tedious compared to WordStat's dictionary rule approach.