Limit to subject entities

simongray commented 8 years ago

In order to improve the results of the sentiment analysis slightly (they are beyond horrible) I want to limit to only entities that are the subject of the sentence. This is how it should have been done from the beginning, anyway. This should preferably be included as a new custom annotator, e.g. SubjectAnnotator, to annotate the subject (SubjectAnnotation.class) in each S. I could then simply use the subject and named entities that overlap for sentiment.

I've been reading up on it on Wikipedia and it seems like I just pick the first NP in phrase structure grammar to get the subject: https://en.wikipedia.org/wiki/Subject_(grammar)

This post on stackoverflow seems to confirm that idea and also posits using the dependency parse to do it: http://stackoverflow.com/questions/18968457/is-there-a-reasonably-accurate-heuristic-for-detecting-the-subject-and-object-of

It might be a good idea to try out both if they're relatively easy to implement, and then simply enable either one through a config option in properties, e.g. subjectobject.parser = "depparse" ro subjectobject.parser = "parse".

Gonna start on the dependency parser version, seems like it makes the most sense. I can use this as a reference and also for justifying choices in the report: http://nlp.stanford.edu/software/dependencies_manual.pdf

Ref for depparse/NNDEP: http://nlp.stanford.edu/software/nndep.shtml

..

In addition to this, I'll also have to implement an alternative sentiment analysis solution, perhaps as a separate annotator and using SentiStrength to have a backup.

simongray commented 8 years ago

Another resource: http://nishutayaltech.blogspot.dk/2015/08/dependency-parsing-in-stanford-corenlp.html

simongray commented 8 years ago

For sentences where "I" or "my ..." is the subject, the object suddenly becomes more relevant.

For sentences with "he/she/it thinks/likes/hates etc." then the sentiment becomes irrelevant, since the opinion holder is not the reddit user, but someone else. I need to find a way to annotate those opinion words (probably using TokensRegex).

simongray commented 8 years ago

Perhaps the Sentic Patterns paper should be revisited, they also use parsing to determine stuff.

simongray commented 8 years ago

More: https://www.quora.com/How-can-I-find-the-subject-and-the-object-of-a-sentence-using-NLTK-in-python

http://stackoverflow.com/questions/5556778/is-there-a-way-to-get-the-subject-of-a-sentence-using-opennlp

http://stackoverflow.com/questions/19417381/collpased-dependencies-with-stanford-core-nlp

simongray commented 8 years ago

Continued in #23

simongray / StatementAnnotator

Limit to subject entities #21