This PR implements two alternatives segmentation flavors:
article/light: Segment the document into header and body, and extract only title, authors, dois, and publication date if available, leaving everything else in the body.
article/light-ref: Segment the document into header, body and references, and extract only title, authors, dois, and publication date if available, leaving everything else in the body.
The article's body is then composed by two paragraphs:
The first paragraph contains leftover from the header, since the extraction may be sparse, this avoids to miss data
coverage: 40.572% (-0.2%) from 40.739%
when pulling 2365fac8c0ec40875b2b2d9e4db779313e55131d on feature/segmentation-light
into 01745c06b1cb836bdc560a08427d9530effb4148 on flavor.
This PR implements two alternatives segmentation flavors:
article/light
: Segment the document into header and body, and extract only title, authors, dois, and publication date if available, leaving everything else in the body.article/light-ref
: Segment the document into header, body and references, and extract only title, authors, dois, and publication date if available, leaving everything else in the body.The article's body is then composed by two paragraphs:
The PR #1151 was tested in this PR.