umcu / clinlp

A Python library for performing NLP on clinical text written in Dutch
GNU General Public License v3.0
33 stars 0 forks source link

Error in entity matcher #65

Open loverma2 opened 4 months ago

loverma2 commented 4 months ago

Hi colleagues,

For research purposes I have loaded the entire Dutch UMLS on the basis of which I would like to perform NER+L with clinlp. I aim to extract and link all entities in over 90,000 clinical reports (anamneses) of heart failure patients with which I can do further research.

This is an example of what my concept2cuis dictionary looks like.

{'C0000039': ['1 2 dipalmitoylphosphatidylcholine'],
 'C0000097': ['methyl phenyltetrahydropyridine',
  '1 methyl 4 phenyl 1 2 3 6 tetrahydropyridine',
  'mptp']
 'C0000215': ['2 4 5 trichlorophenoxyacetic acid'],
 'C0000220': ['2 4 dichlorophenoxyacetic acid'],
 'C0000266': ['parlodel'],
 'C0000294': ['mesna',
  'mercapto ethane sulphonic acid',
  'sodium 2 mercapto ethane sulphonate'],
 'C0000378': ['dops',
  'droxidopa',
  'l dops',
  'l dihydroxyphenylserine',
  'l threo dihydroxyphenylserine'],
 'C0000379': ['3 4 methylenedioxyamphetamine',
  'mda',
  'methylenedioxyamphetamine',
  'tenamphetamine',
  'tenamphetamine']
 'C0000392': ['beta alanine'],
 'C0000402': ['meglutol'],
 'C0000464': ['docosahexaenoate']

However, I run into an error when running the nlp pipeline.

---------------------------------------------------------------------------
ValueError Traceback (most recent call last).
<cell line: 7>() in <cell line: 7>.
      5 )
      6 
----> 7 doc = nlp(text)

c:doc in __call__(self, text, disable, component_cfg)
   1045 raise ValueError(Errors.E109.format(name=name)) from e
   1046 except Exception as e:
-> 1047 error_handler(name, proc, [doc], e)
   1048 if not isinstance(doc, doc):
   1049 raise ValueError(Errors.E005.format(name=name, returned_type=type(doc)))

c:c:e in raise_error(proc_name, proc, docs, e)
   1722 
   1723 def raise_error(proc_name, proc, docs, e):
-> 1724 raise e
   1725 
   1726 

c:__call__(self, text, disable, component_cfg) in __call__(self, text, disable, component_cfg)
   1040 error_handler = proc.get_error_handler()
   1041 try:
-> 1042 doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
   1043 except KeyError as e:
   1044 # This typically happens if a component is not initialized

c:ents.py in __call__(self, doc)
    167 ents.append(Span(doc=doc, start=start, end=end, label=self._concepts[match_id]))
    168 
--> 169 doc.set_ents(entities=ents)
    170 
    171 return doc

c:³ in spacy.tokens.doc.set_ents()

**ValueError: [E1010] Unable to set entity information for token 1 which is included in more than one span in entities, blocked, missing or outside.**

I think this has to do with the fact that there can be either duplicate CUIs (keys) or duplicate concepts (values) in the UMLS. I would like the pipeline to be able to deal with this, because it makes clinical sense to have duplicates. For example, a list of matches or picking the first possible match would be great. Is this possible?

Thank you very much in advance and if you need more information please let me know!

Malin

vmenger commented 4 months ago

Hi Malin,

Yes, I see the problem in case a term is associated with multiple concepts (CUIs) (see also #36). The solution would be for clinlp to use spans instead of entities, and I think we should implement that change in a new release. Though right now I'm unsure if that would cause any other problems, so we need to investigate further.

In the meantime, can you check whether you are using the last version of clinlp (0.6.4)? I'm not 100% sure what exactly causes this error, so if you happen to have a minimal example (concepts and a sample text), that would also be of value to further resolve the issue.

Thanks, Vincent

vmenger commented 3 weeks ago

Hi @loverma2, I just released version 0.8.0 of clinlp, which should be able to handle overlapping concepts.

You will need to make some small changes to your code to keep it working, you can find all changes here: https://github.com/umcu/clinlp/blob/main/CHANGELOG.md

The entities can be found in doc.spans['ents'] (rather than the previous doc.ents). By default, overlapping entities are kept, but you can also configure the pipeline to resolve overlap (takes longest, assuming that is the most specific):

nlp.add_pipe('clinlp_rule_based_entity_matcher', config={'resolve_overlap': True}

Let me know if this helped at all, or if you need any help migrating to the latest version. We can always schedule a call or have a short meeting if that's helpful as well. Curious if this solved the problem, or whether some other issue still exists.

Best, Vincent

loverma2 commented 2 weeks ago

Awesome, thank you Vincent!