scossin / iamsystem_python

Fast dictionary-based approach for semantic annotation / entity linking
MIT License
7 stars 1 forks source link

Performance issue because annotations overlap with a large window and in long documents #11

Closed scossin closed 1 year ago

scossin commented 1 year ago

During performance tests, I noticed that many overlapping annotations can be created when w is large; and since rm_nested_annot function performs a nested loop on each set of nested annotations, deleting nested annotations can take O(x2) with x the number of a nested annotations set.

Although there may be a faster algorithm for removing nested annotations, it is not appropriate to annotate every possible transition. For example:

matcher = Matcher.build(keywords=["cancer de la prostate"],
                        w=3)
annots = matcher.annot_text(text="cancer cancer de de la la prostate prostate")
for annot in annots:
    print(annot)
cancer de la prostate   0 6;14 16;23 34 cancer de la prostate
cancer de la prostate   0 6;17 19;23 34 cancer de la prostate
cancer de la prostate   0 6;14 16;20 22;26 34   cancer de la prostate
cancer de la prostate   0 6;17 22;26 34 cancer de la prostate
cancer de la prostate   0 6;14 16;23 25;35 43   cancer de la prostate
cancer de la prostate   0 6;17 19;23 25;35 43   cancer de la prostate
cancer de la prostate   0 6;14 16;20 22;35 43   cancer de la prostate
cancer de la prostate   0 6;17 22;35 43 cancer de la prostate
cancer de la prostate   7 16;23 34  cancer de la prostate
cancer de la prostate   7 13;17 19;23 34    cancer de la prostate
cancer de la prostate   7 16;20 22;26 34    cancer de la prostate
cancer de la prostate   7 13;17 22;26 34    cancer de la prostate
cancer de la prostate   7 16;23 25;35 43    cancer de la prostate
cancer de la prostate   7 13;17 19;23 25;35 43  cancer de la prostate
cancer de la prostate   7 16;20 22;35 43    cancer de la prostate
cancer de la prostate   7 13;17 22;35 43    cancer de la prostate

For each keyword, there are 2 paths and so 2⁴ paths/annotations are generated in this example. One solution to this problem that arises when setting up a large window on long documents is to store the states in a set rather than an array. By doing this, a path would replace an existing path, thus producing in this example a single annotation. The total number of states/paths that the algorithm could take will be limited by the size of the terminology and therefore the number of annotations cannot grow exponentially.