google-code-export / dkpro-core-asl

Automatically exported from code.google.com/p/dkpro-core-asl
0 stars 0 forks source link

Better phrase matching logic in DictionaryAnnotator #569

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
The current DictionaryAnnotator implementation checks for overlapping phrases 
to be annotated, and disallows overlapping annotations produced if and only if 
the overlapping phrases start at the same token. I.e., for tokens sequence 
w1 w2 w3 w4 w5

and phrase tree containing
w2 w3 w4 + w2 w3
resulting markup is
"w2 w3 w4" ("w2 w3" is dropped).

However, the phrase tree containing
w2 w3 w4 + w3 w4
resulting markup is
"w2 w3 w4" + "w3 w4".

This is in my opinion, counter intuitive. The annotator would be better off 
allowing any kind of overlap, or disallowing any kind of overlap. The first 
option resembles more the intuitive phrase lookup usecase, while the latter is 
compatible with the "chunking style" usecase, such as a dictionary based Named 
Entity Annotator. For the no-overlap case, a longest match preferred, 
left-first in case of ties is a straightforward matching strategy.

What steps will reproduce the problem?
1. Sample text above, w1 ... w5
2. Phrase as above
3. Check markups produced

What is the expected output? What do you see instead?

I think the expected output should be either all overlaps marked up or none, 
controlled by a configuration parameter.

What version of the product are you using? On what operating system?

Current, non-OS-specific.

Please provide any additional information below.

This can be easily implemented by creating a small class for matched phrases, 
implementing Comparable (compares by token length and then begin offset -- 
a.k.a. longest prefered, left first), and storing all matches within the 
sentence in a TreeSet. Than while traversing the set, i) all matches can be 
added, or ii) the current match can be quickly checked against previous items 
in the TreeSet depending on the annotator's configuration parameter:

i:        final TreeSet<Match> allMatches = getMatches(...);
//NOTE, THIS IS VERY SIMILAR TO THE CURRENT ROUTINE, JUST ALSO ALLOWING 
MULTIPLE MATCHES WITH THE SAME START TOKEN

//AND THEN IN 6 LINES, HOW TO SELECT NON-OVERLAPPING MATCHES (FROM THE SORTED 
SET AS DESCRIBED ABOVE)
ii:        
for (final Match match : allMatches) {
    boolean overlaps = false;
    for (final Match matched : alreadySelected)
        if (!(matched.beginIndex > match.endIndex || matched.endIndex < match.beginIndex))
            overlaps = true;
}
if (!overlaps) alreadySelected.add(match);

Original issue reported on code.google.com by szarv...@amazon.de on 18 Dec 2014 at 3:51