The current DictionaryAnnotator implementation checks for overlapping phrases
to be annotated, and disallows overlapping annotations produced if and only if
the overlapping phrases start at the same token. I.e., for tokens sequence
w1 w2 w3 w4 w5
and phrase tree containing
w2 w3 w4 + w2 w3
resulting markup is
"w2 w3 w4" ("w2 w3" is dropped).
However, the phrase tree containing
w2 w3 w4 + w3 w4
resulting markup is
"w2 w3 w4" + "w3 w4".
This is in my opinion, counter intuitive. The annotator would be better off
allowing any kind of overlap, or disallowing any kind of overlap. The first
option resembles more the intuitive phrase lookup usecase, while the latter is
compatible with the "chunking style" usecase, such as a dictionary based Named
Entity Annotator. For the no-overlap case, a longest match preferred,
left-first in case of ties is a straightforward matching strategy.
What steps will reproduce the problem?
1. Sample text above, w1 ... w5
2. Phrase as above
3. Check markups produced
What is the expected output? What do you see instead?
I think the expected output should be either all overlaps marked up or none,
controlled by a configuration parameter.
What version of the product are you using? On what operating system?
Current, non-OS-specific.
Please provide any additional information below.
This can be easily implemented by creating a small class for matched phrases,
implementing Comparable (compares by token length and then begin offset --
a.k.a. longest prefered, left first), and storing all matches within the
sentence in a TreeSet. Than while traversing the set, i) all matches can be
added, or ii) the current match can be quickly checked against previous items
in the TreeSet depending on the annotator's configuration parameter:
i: final TreeSet<Match> allMatches = getMatches(...);
//NOTE, THIS IS VERY SIMILAR TO THE CURRENT ROUTINE, JUST ALSO ALLOWING
MULTIPLE MATCHES WITH THE SAME START TOKEN
//AND THEN IN 6 LINES, HOW TO SELECT NON-OVERLAPPING MATCHES (FROM THE SORTED
SET AS DESCRIBED ABOVE)
ii:
for (final Match match : allMatches) {
boolean overlaps = false;
for (final Match matched : alreadySelected)
if (!(matched.beginIndex > match.endIndex || matched.endIndex < match.beginIndex))
overlaps = true;
}
if (!overlaps) alreadySelected.add(match);
Original issue reported on code.google.com by szarv...@amazon.de on 18 Dec 2014 at 3:51
Original issue reported on code.google.com by
szarv...@amazon.de
on 18 Dec 2014 at 3:51