sorgerlab / indra

INDRA (Integrated Network and Dynamical Reasoning Assembler) is an automated model assembly system interfacing with NLP systems and databases to collect knowledge, and through a process of assembly, produce causal graphs and dynamical models.
http://indra.bio
BSD 2-Clause "Simplified" License
173 stars 65 forks source link

Propagate section information from new Reach implementation #1399

Closed bgyori closed 1 year ago

bgyori commented 1 year ago

This PR adapts to recent changes in Reach for extracting section information when reading nxml files. There was an old implementation of this but Reach stopped producing section names at some point, and the new reinstated implementation is different, so the code on the INDRA side also had to be adapted. I did some empirical statistics on the kinds of (unnormalized) section names that occur and made improvements to their normalization.

Independently, it looks like PubMed changed their search API to return a maximum of 10k instead of 100k IDs for searches, requiring updates to tests. I also improved the way we get MeSH IDs from non-standard MeSH URNs from MedScan.