umcu / clinlp

A Python library for performing NLP on clinical text written in Dutch
GNU General Public License v3.0
33 stars 0 forks source link

Fix issues with overlapping entities #36

Closed vmenger closed 1 month ago

vmenger commented 10 months ago

The EntityMatcher component writes the entities it matches to the doc.ents using doc.set_ents, which does not allow for overlap and may thus raise an error. It would be nice to have some more options here, like removing the overlap (how?) or warning/informing the user of any errors.

Minimal example:

import spacy
import clinlp

if __name__ == '__main__':

    concepts = {
        'test': [
            [
                {"NORM": {"IN": ["zag", "ziet", "hoort", "hoorde", "ruikt", "rook"]}},
                {"OP": "?"},
                {"OP": "?"},
                {"OP": "?"},
                {"NORM": {"FUZZY1": "dingen"}},
                {"OP": "?"},
                {"NORM": "die"},
                {"NORM": "er"},
                {"OP": "?"},
                {"NORM": "niet"},
                {"OP": "?"},
                {"NORM": {"IN": ["zijn", "waren"]}}
            ],

        ]
    }

    nlp = spacy.blank('clinlp')
    entity_matcher = nlp.add_pipe("clinlp_entity_matcher")
    entity_matcher.load_concepts(concepts)

    nlp('hoort/ziet dingen die er niet zijn')
vmenger commented 9 months ago

Another option: use spans, which do allow for overlap.

vmenger commented 9 months ago

Added a simple fix that removes overlap by taking the longest span in #41, but a more structural solution is still desired imo.