simongray / StatementAnnotator

Custom annotator for Stanford CoreNLP that annotates sentences with the underlying statements contained within them.
4 stars 0 forks source link

Statement comparison #53

Closed simongray closed 7 years ago

simongray commented 8 years ago

Statements need to be able to be compared. It might be worth starting out by testing a simplified version of it that doesn't use semantic similarity and then making some kind of cofiguration option to enable it (re-using most of the simplified version).

simongray commented 8 years ago

It seems that ADW is more likely to get a high score with multiple words rather than comparing single words. When comparing single words (SURFACE) without using lemma & POS tags (SURFACE_TAGGED) often the algorithm just returns a score of 0 and when comparing single words that are SURFACE_TAGGED then the scores between e.g. like/love, hate/dislike are not as strong as for sentences featuring these words.

Maybe simply comparing the entire statement is the best way forward and then some kind of additional quality assurance if the score is more than some specific limit. Perhaps each component can be compared (using or not using semantic similarity) to gauge where the statement is most alike.

simongray commented 8 years ago

Here's another source of semantic similarity: http://ws4jdemo.appspot.com/ https://code.google.com/archive/p/ws4j/

simongray commented 8 years ago

ADW - so far - is quite slow, so if I need to use I need to - at the very least - filter the results before using it. I thought about simply comparing lemmas on components and then checking how many are equal. This can work as both a kind of topic search as well as a way to compare statements in varying detail.

simongray commented 8 years ago

Findings so far

Simply using primary (lemmas) has proven a difficult way to discover matching statements.

One of the reasons is probably that the statements are not complete welformed, but another is that the equality needed is too strong. In the original conception, semantic similarity would be used instead of equality on the individual components - at least in case of the verb - to find similar statements. However, semantic similarity calculation is very slow - at least when using ADW - and the scores are usually not very convincing when only comparing single words, i.e. the scores are really all over the place for similar words, even when they are tagged as verbs. Furthermore, verbs such as "like" seem to always produce 0.0 scores in any comparison which also makes it seem buggy.

Perhaps I will have to implement some patterns manually, such as "I " or " is " for narrowing down clear opinions?

In any case I will hav to figure out a better way of finding matching statements. Perhaps an approach that

  1. ignores uninteresting statements entirely, e.g. statements with unclear topics using this/that/they/it etc. for Subject and DirectObject - this has already been implemented somewhat
  2. ignores statements of unknown form, e.g. V+IO+E or some other strange combination?
  3. can be loosened up to simply look for statements with the same general topic, not necessarily opinion, e.g. "I try to meditate every morning" matches "some day I want to learn how to meditate" where "meditate" is the topic of both statements.

Other things to fix

Statement finding can always get better.

  1. I can check the POS tag and discard some false positive Verbs for that reason
simongray commented 8 years ago

Tip

Start out small! Start by saying statements are similar if they have the same e.g. subject, then build up from that.

simongray commented 8 years ago

A crude but useful way to do comparisons is simply to output the number of matched components and then use the highest counts as the most similar statements for the interface. A count minimum could be set, as could some other checks such as semantic similarity and checking negatives.

simongray commented 8 years ago

(to combat weirdly phrased statements in the UI, it might make a lot of sense to use the entire sentence and then paint in the statement with colour or bold)

simongray commented 8 years ago

Ok, it seems like I have nothing in common with my brother >_< at least not when comparing even single components directly... very distressing.

simongray commented 8 years ago

It might still be possible to use Wordnet as a theosaurus to find synonyms for doing comparisons in a more flexible way.

simongray commented 7 years ago

I think I should implement the Jaccard index for getting similar statements, perhaps with some corrective additions: https://en.wikipedia.org/wiki/Jaccard_index

Should be possible, given a low enough limit, to get some kind of similar statements out using this very basic method.

simongray commented 7 years ago

I'm moving from direct Statement comparison to just using patterns to match profiles against statements based on certain parameters.