gsireesh / ht-max

Code for the HT-MAX project
Apache License 2.0
0 stars 1 forks source link

Integrate PDF Highlights Parser #16

Closed gsireesh closed 7 months ago

gsireesh commented 7 months ago

This PR integrates @kamurphy11's PDF highlight detection code into a papermage parser that now adds a layer of highlights. This is incorporated into the MaterialsRecipe.

NOTE: this code will currrently cause errors for parsing most PDF files. This is because PaperMage layers expect entities to be non-overlapping, which is complicated by a number of factors in our PDFs: symbols like "®" cause highlights to overlap with the previous line, leading to overwide spans for some annotations; we also have straightforwardly overlapping annotations.

Currently, at least PDF 6 in the os.listdir list can be correctly parsed - that is, "Effects of build direction and heat treatment on creep properties of Ni-base superalloy built up by additive manufacturing.pdf"

This closes out #2 and #14.