Question about section type in REACH processor

sorgerlab / indra

INDRA (Integrated Network and Dynamical Reasoning Assembler) is an automated model assembly system interfacing with NLP systems and databases to collect knowledge, and through a process of assembly, produce causal graphs and dynamical models.

http://indra.bio

BSD 2-Clause "Simplified" License

173 stars 65 forks source link

Question about section type in REACH processor #1388

Closed sanyabt closed 1 year ago

sanyabt commented 2 years ago

Hi, I am trying to extract the section type from the INDRA statement evidence after processing with the REACH processor. However, the value is always null in the extracted statements. Is the section_type field assigned only in particular cases or do I need to set something for it? I can see the section_type in the code here but am not sure when it is assigned. I am using INDRA v1.19.0 and REACH v1.6.3. Thanks!

bgyori commented 2 years ago

This is a known issue in Reach and is going to be addressed very soon. Once section titles are correctly extracted by Reach I will comment here to let you know. Thanks!

bgyori commented 1 year ago

Hi @sanyabt, this was finally addressed in Reach a few days ago, see https://github.com/clulab/reach/pull/775. The new implementation requires some changes on the INDRA side as well: https://github.com/sorgerlab/indra/pull/1399. Once I merge those changes, we can close this issue, you just need to use the latest versions of both systems to get section names.

sanyabt commented 1 year ago

Awesome, thank you so much! I will look out for the commits.

bgyori commented 1 year ago

Done in #1399

sanyabt commented 1 year ago

Hi @bgyori, I wanted to ask if extracting section_type only works with the "process_nxml_file" function in the REACH API? I've been using "process_text" with a local server which seems to be much faster than "process_nxml_file". The local server gets overloaded, however, when I try to run "process_nxml_file" to get the statements with section_type.

bgyori commented 1 year ago

If you are reading plain text then process_text is the way to go. However, for NXML-formatted content (which is the only input format that carries section information), you need to use one of the NXML-specific functions like process_nxml_file. For a local web server you would call it like

from indra.sources import reach
rp = reach.process_nxml_file("input.nxml", url=reach.local_nxml_url)

Not sure what happens in terms of the Reach server getting "overloaded" - do you mean it returns slowly or crashes?

sanyabt commented 1 year ago

That's exactly how I was trying it! But after about 10 minutes I get the error - "ERROR: [2022-12-07 10:04:49] indra.sources.reach.api - Could not process NXML via REACH service.Status code: 503".

bgyori commented 1 year ago

I see, if you attach or email me the NXML file, I can try it out to see what I get.

sanyabt commented 1 year ago

Oh, I think it's 2 of my nxml files that are problematic and not the function or server! I tested out on a batch of nxml files just to be sure and it is able to process all except the 2. Sorry about that and thank you for the quick responses!