Removing duplicates from locs (location range of CUI annotations)

kaushikacharya commented 5 years ago

https://github.com/ncbi-nlp/NegBio/blob/master/negbio/pipeline/negdetect.py#L73-L76

        locs = []
        for ann in passage.annotations:
            total_loc = ann.get_total_location()
            locs.append((total_loc.offset, total_loc.offset + total_loc.length))

Here location range of CUIs are collected, out of which some of them can be duplicates. This happens because MetaMap creates multiple CUIs for the same text span.

And then neg_detector.py # detect() iterates over the for loop of locs: https://github.com/ncbi-nlp/NegBio/blob/master/negbio/neg/neg_detector.py#L44

for loc in locs:

Isn't it better if we remove duplicates from locs in negdetect.py by using

locs = list(set(locs))

An example of duplicate loc elements:

For the sentence:

There is no spinal canal hematoma.

following two CUIs are generated in the same location span:

 <annotation id="2">
    <infon key="term">Hematoma</infon>
    <infon key="semtype">patf</infon>
    <infon key="CUI">C0018944</infon>
    <infon key="annotator">MetaMap</infon>
    <location length="8" offset="25"/>
    <text>hematoma</text>
  </annotation>

  <annotation id="4">
    <infon key="term">Hematoma Adverse Event</infon>
    <infon key="semtype">fndg</infon>
    <infon key="CUI">C1962958</infon>
    <infon key="annotator">MetaMap</infon>
    <location length="8" offset="25"/>
    <text>hematoma</text>
  </annotation>

kaushikacharya commented 5 years ago

@yfpeng I figured out a better solution.

https://github.com/ncbi-nlp/NegBio/blob/master/negbio/pipeline/negdetect.py#L83 for name, matcher, loc in detector.detect(sentence, locs):

Instead of passing the locs for the entire passage, we can pass the unique locs for the current sentence.

https://github.com/ncbi-nlp/NegBio/blob/master/negbio/neg/neg_detector.py#L47 for node in find_nodes(g, loc[0], loc[1]):

All the locs which don't belong to the current sentence are currently getting unnecessarily processed.

https://github.com/ncbi-nlp/NegBio/blob/master/negbio/pipeline/negdetect.py#L85 _mark_anns(passage.annotations, loc[0], loc[1], name) Also all the annotations belonging to other sentences are also checked for overlap with the loc for which we found negation/uncertainty.

I have made the above changes in my local copy. In a CT report with around 60 sentences, I have found the following step reduced from 25 seconds to 14 seconds: https://github.com/ncbi-nlp/NegBio/blob/master/negbio/main_mm.py#L60 document = negdetect.detect(document, neg_detector)

yfpeng commented 3 years ago

Thank you!

ncbi-nlp / NegBio

Removing duplicates from locs (location range of CUI annotations) #31