gsireesh / ht-max

Code for the HT-MAX project
Apache License 2.0
0 stars 1 forks source link

Hierarchical Reading Order #5

Closed gsireesh closed 7 months ago

gsireesh commented 8 months ago

Papermage currently extracts section headings, but does not extract text that belongs to those sections, even as it has sentences/paragraphs that can be associated.

Find a way to render a PDF in a natural, "hierarchical" reading order that allows us to annotate per-section metadata.

This can either be using PaperMage + heuristics, or it can be with a totally separate tool, like watr-works or grobid

Tasks: