Closed simongray closed 7 years ago
It seems that ADW is more likely to get a high score with multiple words rather than comparing single words. When comparing single words (SURFACE) without using lemma & POS tags (SURFACE_TAGGED) often the algorithm just returns a score of 0 and when comparing single words that are SURFACE_TAGGED then the scores between e.g. like/love, hate/dislike are not as strong as for sentences featuring these words.
Maybe simply comparing the entire statement is the best way forward and then some kind of additional quality assurance if the score is more than some specific limit. Perhaps each component can be compared (using or not using semantic similarity) to gauge where the statement is most alike.
Here's another source of semantic similarity: http://ws4jdemo.appspot.com/ https://code.google.com/archive/p/ws4j/
ADW - so far - is quite slow, so if I need to use I need to - at the very least - filter the results before using it. I thought about simply comparing lemmas on components and then checking how many are equal. This can work as both a kind of topic search as well as a way to compare statements in varying detail.
Simply using primary (lemmas) has proven a difficult way to discover matching statements.
One of the reasons is probably that the statements are not complete welformed, but another is that the equality needed is too strong. In the original conception, semantic similarity would be used instead of equality on the individual components - at least in case of the verb - to find similar statements. However, semantic similarity calculation is very slow - at least when using ADW - and the scores are usually not very convincing when only comparing single words, i.e. the scores are really all over the place for similar words, even when they are tagged as verbs. Furthermore, verbs such as "like" seem to always produce 0.0 scores in any comparison which also makes it seem buggy.
Perhaps I will have to implement some patterns manually, such as "I
In any case I will hav to figure out a better way of finding matching statements. Perhaps an approach that
Statement finding can always get better.
Start out small! Start by saying statements are similar if they have the same e.g. subject, then build up from that.
A crude but useful way to do comparisons is simply to output the number of matched components and then use the highest counts as the most similar statements for the interface. A count minimum could be set, as could some other checks such as semantic similarity and checking negatives.
(to combat weirdly phrased statements in the UI, it might make a lot of sense to use the entire sentence and then paint in the statement with colour or bold)
Ok, it seems like I have nothing in common with my brother >_< at least not when comparing even single components directly... very distressing.
It might still be possible to use Wordnet as a theosaurus to find synonyms for doing comparisons in a more flexible way.
I think I should implement the Jaccard index for getting similar statements, perhaps with some corrective additions: https://en.wikipedia.org/wiki/Jaccard_index
Should be possible, given a low enough limit, to get some kind of similar statements out using this very basic method.
I'm moving from direct Statement comparison to just using patterns to match profiles against statements based on certain parameters.
Statements need to be able to be compared. It might be worth starting out by testing a simplified version of it that doesn't use semantic similarity and then making some kind of cofiguration option to enable it (re-using most of the simplified version).