(subsumes issue #21)

Rather than focusing simply on sentiment and entities, I want to get into more detail on the opinions. The base of this new system is an annotator called SemanticAnnotator that can find subjects and objects in a sentence. The objects can be both concepts (verb or verb+object) or simply noun phrases.

The goal is to use these semantic annotations to extract more exact opinions. By default, the assumed opinion holder is "the I" (the writer) of the comment, but s/he might also be expressing something about "you" (e.g. the other user or a general you, meaning "one") or about a third party (some entity). Focusing on sentence subjects provides a great deal more entities to compare than simply using tagged named entities would.

These opinions can then in turn also contain a sentiment which can be attached using the chosen sentiment analysis system (Stanford's or perhaps SentiStrength).

Opinion matching

Direct opinion matching

However, the sentiment becomes less relevant if the relationships can be matched, e.g. when comparing opinions if both Person X and Person Y "play football" then sentiment is not even needed. The same goes for statements like "Sanders is going to win" etc. - if both people have expressed the same statement at some point, that is.

Matching on sentiment

Sentiment would be necessary in case of comparing opinions on third parties, e.g if Person X thinks that "Clinton sucks donkeyballs" it might be hard to find an identical statement by Person Y. However, if both have an overall negative opinion about Clinton, then this can be used.

Fuzzy opinion matching

Another and more precise way is to perform opinion matching is by using fuzzy opinion matching. Let's say both users have expressed an opinion relating to Bernie Sanders' chances of winning the US presidency in 2016. Person X has expressed "I think [Bernie Sanders] will definitely [win] the [election]" while Person Y has expressed that "[Sanders] looks like he could [win] [this]". A fuzzy opinion matching algorithm would look at the semantics of the sentences and conclude that

the subject is a fuzzy match
the verb is a full match
the object is a fuzzy match (as "this" is non-specific, but could - in principle - refer to an election)

= we have a fuzzy match!

The same could occur if instead of "this" Person Y had written "the [presidency]". In that case, a database of word vectors or some other concept similarity measure could be used for the fuzzy matching and the word vector distance (or whatever quantitative measure is used) could provide a heuristic for the fuzzy matching.

Types of opinions or statements

Different types of opinions or statements are compared in different ways. For example

Self-referential statements where the "I" or "my ..." define the subject
Statements about specific entities, e.g. "Bernie Sanders" or "Donald Trump". NER tagging can be useful in figuring out whether that entity is a person, which means we could have yet another subclass of statement concerning persons (or locations or some other entity type).
...

They may all implement the same interface or abstract class, but a call to compare() will quickly ascertain that the subclasses differ and they are therefore not matching.

Creating these types of opinions/statements (I will figure out what to call them eventually) needs to be done using some sort of Factory method that takes a SemanticGraph, possibly together with each nsubj relation and spits out an opinion/statement type (conforming to the interface, but possibly being A) a subclass or B) has some kind of identifying field saying what type it is).

Fast opinion search

Making opinion comparisons fast enough for real-time comparisons, it is necessary to use keyword search. Assuming we have prepared in advance a mapping from keyword to relevant opinions, then it is simply a matter of comparing each case of a keyword match (a sentence or comment containing a sentence perhaps) to the relevant opinions.

The keywords can simply be the shortest form available of either the subject or the object. For most cases, it makes sense to just look for the subject ("Trump", "iOS", "Android", etc.), however for self-referential opinions the subject is the self and the interesting thing is the object that's being verbed, making the object the obvious keyword candidate.

With this list of simple keywords, comments are searched through one by one and when match is found, it is simply run through the CoreNLP pipeline to extract statements/opinions for comparison with the relevant statements/opinions in the mapping.

Definitions of opinion vs sentiment

I want to be inspired by Bing Liu's definitions of opinion and sentiment and eventually produce Java classes that closely resemble them. Most likely, I will skip the aspect... aspect!

Definition 2.1 (Opinion):

An opinion is a quadruple,
(g,s,h,t),
where g is the sentiment target, s is the sentiment of the opinion about the target g, h is the opinion holder (the person or organization who holds the opinion), and t is the time when the opinion is expressed.

and

Definition 2.4 (Sentiment):

Sentiment is the underlying feeling, attitude, evaluation, or emotion associated with an opinion. It is represented as a triple,
(y,o,i),

where y is the type of the sentiment, o is the orientation of the sentiment, and i is
the intensity of the sentiment.

and

Definition 2.7 (Opinion):

An opinion is a quintuple,
(e,a,s,h,t),
where e is the target entity, a is the target aspect of entity e on which the opinion has been given, s is the sentiment of the opinion on aspect a of entity e, h is the opinion holder, and t is the opinion posting time; s can be positive, negative, or neutral, or a rating (e.g., 1–5 stars). When an opinion is only on the entity as a whole, the special aspect GENERAL is used to denote it. Here e and a together represent the opinion target.

Implementation details

TBD

Meta information

Prior annotating using the CoreNLP pipeline, relevant meta-info is included by setting custom RelevantNameForAnnotations on the Annotation object. The kind of information that is included:

timestamp
domain (e.g. reddit.com)
source (e.g. a link, perhaps also the subreddit)
score (most social networks have a scoring system of some kind)

In order to deal with more domain-specific infomation, such as score, the annotator needs a class to interpret them. This class can be supplied as a config option in the pipeline or perhaps even as part of the annotation in the shape of a reference to a Singleton meta-info interpreter. Domain-specific syntax such as that expressed in #12 can also be handled in this way, and preprocessing functionality such as stripping markdown can be defined as part of the pipeline, rather than as a separate step occurring before.

This meta information is useful for backtracking during both debugging and as a general feature in the UI (SemanticAnnotations are deep-linked into sentences of comments). It also serves as a ranking mechanism for doing the comparisons, e.g. timestamp can be used to rank opinions somewhat.

General architecture

PreprocessingAnnotator? - perhaps this conceptually doesn't make that much sense in the pipeline after all?
LanguageAnnotator? - annotates sentences with language using Apache Tika
SemanticAnnotator/OpinionAnnotator? - annotates sentences with opinions and possibly personal information too.

A SemanticAnnotator annotates social media (e.g. reddit) comments and the annotations found (let's call them opinions) are attached at the sentence level.

An opinion contains a subject and a predicate, including a verb and often some kind of object. The Opinion class contains a compare method that performs fuzzy matching (including direct and sentiment matching too) as described above.

... more TBD

Integration with personal details

TBD

simongray / StatementAnnotator

SemanticAnnotator, opinions and sentiment #23