nleguillarme / taxonerd

TaxoNERD : recognizing taxonomic entities using deep models
MIT License
38 stars 8 forks source link

`EntityLinker` can't index concept ID's in database for `ncbi_taxonomy`, and default KBs don't generate candidates #20

Closed serenalotreck closed 4 months ago

serenalotreck commented 1 year ago

Context:

I'm using EntityLinker programmatically with the following code:

from taxonerd import TaxoNERD
from taxonerd.linking.linking import EntityLinker

ents = ['M. inflexa', 'bryophytes', 'homo sapiens', 'A. thaliana', 'Arabidopsis thaliana']

taxonerd = TaxoNERD()
nlp = taxonerd.load("en_core_eco_biobert")
doc = nlp(' '.join(ents))

span_idxs = []
for i, ent in enumerate(ents):
    if i == 0:
        start = 0
    else:
        start = len(' '.join(ents[:i]).split(' '))
    end = start + len(ent.split(' '))
    span_idxs.append((start, end))
spans = [Span(doc, e[0], e[1], "ENTITY") for e in span_idxs]
doc.set_ents(spans)

linker = EntityLinker('ncbi_taxonomy', resolve_abbreviations=False)
updated_doc = linker(doc)

Observed Behavior:

This code fails with a KeyError for the NCBI ID's (format NCBI:XXXX) on line 134 of linking.py.

When I try and use the indicated defaults by passing name='umls' or name='mesh' on instantiation instead of using linker_name='ncbi_taxonomy', as indicated in the docs for the EntityLinker class, there is no CandidateGenerator instantiated (meaning that no candidate matches are generated), and passing them as linker_name causes an error from the KnowledgeBase class.

Expected Behavior:

  1. That the ID passed to cui_to_entity from the EntityLinker for NBCI Taxon is a valid key and doesn't throw an error
  2. That using the default KB's as instructed in the class docstring returns a valid CandidateGenerator
nleguillarme commented 12 months ago

The KeyError is caused by the linker trying to access the entity's definition field, which does not exist in the precompiled taxonomies. You have to add resolve_abbreviations=False to the linker's config.

Here is a minimal working example:

from taxonerd import TaxoNERD
from taxonerd.linking.linking import EntityLinker
from spacy.tokens import Span

ents = [
    "M. inflexa",
    "bryophytes",
    "homo sapiens",
    "A. thaliana",
    "Arabidopsis thaliana",
]

taxonerd = TaxoNERD()
nlp = taxonerd.load("en_core_eco_biobert")
doc = nlp(" ".join(ents))

span_idxs = []
for i, ent in enumerate(ents):
    if i == 0:
        start = 0
    else:
        start = len(" ".join(ents[:i]).split(" "))
    end = start + len(ent.split(" "))
    span_idxs.append((start, end))
spans = [Span(doc, e[0], e[1], "ENTITY") for e in span_idxs]
doc.set_ents(spans)

config = {
    "linker_name": "ncbi_taxonomy",
    "resolve_abbreviations": False,
    "filter_for_definitions": False,
}

linker = EntityLinker(**config)
updated_doc = linker(doc)

References to umls or mesh should be removed from the docstring. These are artefacts from scispacy, from which the linker code was copied.