During performance tests, I noticed that many overlapping annotations can be created when w is large; and since rm_nested_annot function performs a nested loop on each set of nested annotations, deleting nested annotations can take O(x2) with x the number of a nested annotations set.
Although there may be a faster algorithm for removing nested annotations, it is not appropriate to annotate every possible transition. For example:
matcher = Matcher.build(keywords=["cancer de la prostate"],
w=3)
annots = matcher.annot_text(text="cancer cancer de de la la prostate prostate")
for annot in annots:
print(annot)
cancer de la prostate 0 6;14 16;23 34 cancer de la prostate
cancer de la prostate 0 6;17 19;23 34 cancer de la prostate
cancer de la prostate 0 6;14 16;20 22;26 34 cancer de la prostate
cancer de la prostate 0 6;17 22;26 34 cancer de la prostate
cancer de la prostate 0 6;14 16;23 25;35 43 cancer de la prostate
cancer de la prostate 0 6;17 19;23 25;35 43 cancer de la prostate
cancer de la prostate 0 6;14 16;20 22;35 43 cancer de la prostate
cancer de la prostate 0 6;17 22;35 43 cancer de la prostate
cancer de la prostate 7 16;23 34 cancer de la prostate
cancer de la prostate 7 13;17 19;23 34 cancer de la prostate
cancer de la prostate 7 16;20 22;26 34 cancer de la prostate
cancer de la prostate 7 13;17 22;26 34 cancer de la prostate
cancer de la prostate 7 16;23 25;35 43 cancer de la prostate
cancer de la prostate 7 13;17 19;23 25;35 43 cancer de la prostate
cancer de la prostate 7 16;20 22;35 43 cancer de la prostate
cancer de la prostate 7 13;17 22;35 43 cancer de la prostate
For each keyword, there are 2 paths and so 2⁴ paths/annotations are generated in this example. One solution to this problem that arises when setting up a large window on long documents is to store the states in a set rather than an array. By doing this, a path would replace an existing path, thus producing in this example a single annotation. The total number of states/paths that the algorithm could take will be limited by the size of the terminology and therefore the number of annotations cannot grow exponentially.
During performance tests, I noticed that many overlapping annotations can be created when w is large; and since rm_nested_annot function performs a nested loop on each set of nested annotations, deleting nested annotations can take O(x2) with x the number of a nested annotations set.
Although there may be a faster algorithm for removing nested annotations, it is not appropriate to annotate every possible transition. For example:
For each keyword, there are 2 paths and so 2⁴ paths/annotations are generated in this example. One solution to this problem that arises when setting up a large window on long documents is to store the states in a set rather than an array. By doing this, a path would replace an existing path, thus producing in this example a single annotation. The total number of states/paths that the algorithm could take will be limited by the size of the terminology and therefore the number of annotations cannot grow exponentially.