This PR introduces a grobid-based parser to detect sections of the document, such that we can ask questions like "what are all of the materials mentioned in this paper's methods section?"
Implementation:
This parser operates on the grobid format for getting sections out. In the grobid document, you have a <body> tag, under which are nested <div>s that correspond to section. This parser works by getting coordinates from each sentence inside a div, and then consolidating them into column-based groups.
Each section then has an assigned entity in the layer reading_order_sections. We store which entity is for which section under the section_name key in the entity Metadata, and the reading order under the key order. Each entity is defined by the boxes that make it up, in this case boxes that are limited to one column on one page.
Notes:
This parser does not extract the abstract from a paper - that's something that the VILA models used elsewhere in papermage seem to handle just fine, so we don't worry about it here.
This parser is limited by grobid's accuracy! In many cases, figure captions are seen as part of columnar text, and cause the box aggregation to go haywire. There's not a lot we can do about this for the moment, so we're leaving it around. In the future, we could see a postprocessing layer that joins boxes after the VILA prediction has been done, such that we can use the figure/table caption boxes to exclude some grobid content.
This PR introduces a grobid-based parser to detect sections of the document, such that we can ask questions like "what are all of the materials mentioned in this paper's methods section?"
Implementation: This parser operates on the grobid format for getting sections out. In the grobid document, you have a
<body>
tag, under which are nested<div>
s that correspond to section. This parser works by getting coordinates from each sentence inside a div, and then consolidating them into column-based groups.Each section then has an assigned entity in the layer
reading_order_sections
. We store which entity is for which section under thesection_name
key in the entityMetadata
, and the reading order under the keyorder
. Each entity is defined by the boxes that make it up, in this case boxes that are limited to one column on one page.Notes:
This PR should close #5.