gsireesh / ht-max

Code for the HT-MAX project
Apache License 2.0
0 stars 1 forks source link

Grobid-based reading order parser #17

Closed gsireesh closed 7 months ago

gsireesh commented 7 months ago

This PR introduces a grobid-based parser to detect sections of the document, such that we can ask questions like "what are all of the materials mentioned in this paper's methods section?"

Implementation: This parser operates on the grobid format for getting sections out. In the grobid document, you have a <body> tag, under which are nested <div>s that correspond to section. This parser works by getting coordinates from each sentence inside a div, and then consolidating them into column-based groups.

Each section then has an assigned entity in the layer reading_order_sections. We store which entity is for which section under the section_name key in the entity Metadata, and the reading order under the key order. Each entity is defined by the boxes that make it up, in this case boxes that are limited to one column on one page.

Notes:

This PR should close #5.