sorgerlab / indra

INDRA (Integrated Network and Dynamical Reasoning Assembler) is an automated model assembly system interfacing with NLP systems and databases to collect knowledge, and through a process of assembly, produce causal graphs and dynamical models.
http://indra.bio
BSD 2-Clause "Simplified" License
175 stars 66 forks source link

INDRA for NER #1294

Closed joelthe1 closed 3 years ago

joelthe1 commented 3 years ago

Question

I am interested in using INDRA for Named Entity Recognition (NER). But it is not clear to me from the documentation on how I could do that. If I understood things correctly, I would basically need to use one or more 'Reading Systems' (as listed in the documentation) to create the INDRA 'statements' and then try to get the named entities from these. If this is correct, could you say more on how I could extract the Named Entities from 'statements'? Also, I assume the ontology for the named entities would be different depending on the selected reading system. Any help is appreciated.

bgyori commented 3 years ago

Hi @joelthe1, thanks for your interest! You can certainly get named entities via INDRA from text and I will describe below some specifics but there is an important issue to consider: if your goal is to find all named entities in text but not relationships between them, INDRA will be a lossy and indirect way to get those. This is because INDRA is specifically aimed at collecting and representing relationships between entities (e.g., A activates B). So if an entity appears in text but is not part of a relationship, it will not be picked up by INDRA.

Having said that, you can certainly use one of the reading systems integrated with INDRA to read some text, get some INDRA Statements, and then iterate over the entities in those Statements to get a list of entities. You can follow these instructions to set up the Reach reader, and then do the following

from indra.sources import reach
rp = reach.process_text("... some text ...", url=reach.local_text_url)
all_agents = []
for stmt in rp.statements:
    all_agents += stmt.real_agent_list()

Each agent will have a name and a db_refs attribute representing its identity (see more info here). The ontology used to ground agents can indeed differ by source and reading systems also often make grounding errors. INDRA's grounding mapper can correct some errors, do model-based disambiguation, and then use mappings to standardize across different ontologies. You can run it on a list of statements as:

from indra.tools import assemble_corpus as ac
stmts = ac.map_grounding(stmts)

Having said all this, again, because this only allows you to look at entities that are parts of relations, you should consider either using one of the reading systems directly to look at their raw NER output, or use a dedicated NER tool (that isn't concerned with relation extraction) to get a more complete result.

joelthe1 commented 3 years ago

Thank you for the prompt and informative response. This well answered my question and so closing this issue. Let me know if there is another place I should ask questions like these (e.g. forum/GitHub Discussions, etc.).