quanteda / quanteda

An R package for the Quantitative Analysis of Textual Data
https://quanteda.io
GNU General Public License v3.0
841 stars 188 forks source link

Add WordStat dictionary rules (e.g. for negation) #516

Open BobMuenchen opened 7 years ago

BobMuenchen commented 7 years ago

I've been using the WordStat Sentiment dictionary:

https://provalisresearch.com/products/content-analysis-software/wordstat-dictionary/

but before I imported it, I removed the negation rules at the top with a text editor and it then works fine as a standard dictionary. WordStat's dictionary negation rules look like this:

NEGATIVE NOT_GOOD @NOTGOOD [#POSITIVE_WORDS AFTER #NEGATIONS /S 3] (1) REAL_BAD @REALBAD [#NEGATIVE_WORDS NOT AFTER #NEGATIONS /S 3] (1)...

If I had left them there, would quanteda have understood them? If so, how would I apply them?

The tidytext package uses ngrams to address negation, as you may have seen here: http://tidytextmining.com/ngrams.html in section 5.1.3 "Using bigrams to provide context in sentiment". However, looking up to four words on either side using logic like that would be quite tedious compared to WordStat's dictionary rule approach.

koheiw commented 7 years ago

There is no function designed specifically for negations, but tokens_ngrams and tokens_compound can be used. While tokens_ngrams generates ngrams of any words, tokens_compound joins only specified words. For example, tokens_compound(toks, 'not *') will join negations and following words to generate tokens like 'not_good' or not_nice.

If you want more generalized expressions of negations, use tokens_lookup.

positive<- c('good', 'nice')
negative <- c('bad', 'nasty')
dict <- dictionary(pos=c(positive, paste('not', negative)), neg=c(negative, paste('not', positive)))
toks <- tokens("The man is not bad, actually pretty good")
tokens_lookup(toks, dict, exclusive = FALSE)
## tokens from 1 document.
## Component 1 :
## [1] "The"      "man"      "is"       "POS"      ","        "actually" "pretty"   "POS"

However, this method does not work when there are words between negations and sentiment words:

toks2 <- tokens("The man is good, and not going to be bad")
tokens_lookup(toks2, dict, exclusive = FALSE)
## tokens from 1 document.
## Component 1 :
## [1] "The"   "man"   "is"    "POS"   ","     "and"   "not"   "going" "to"    "be"    "NEG"
BobMuenchen commented 6 years ago

Thanks for that info! The WordStat approach allows you to specify a gap of N words between the negation term and the sentiment term. Gaps before and after are controlled separately. It's a very helpful approach. I'm sure N is limited to a fairly small number but I don't know it offhand.

kbenoit commented 6 years ago

I've reopened this issue to include ways to implement the WordStat dictionary rules, described in https://www.provalisresearch.com/Documents/WordStat6.pdf pp69-71.

kbenoit commented 6 years ago

Ideas: