harvard-lil / capstone

CAP database scripts.
MIT License
188 stars 44 forks source link

Reparse HTML with new attributes #2144

Closed kilbergr closed 8 months ago

kilbergr commented 1 year ago

The HTML that's currently in the casebody_cache table does not contain paragraph or box tags from the PDFs. That means once move over to the static file version of CAP, we'd lose that information. We'd like to maintain that information in order to be able to do things like rehydrate our HTML from PDFs in case of OCR improvements.

Currently, paragraph attributes are added on export of HTML to casebody_cache table. This is the code that processes the blocks into paragraphs in the html (and the background slack thread).

We're going to alter that code so that we also have an attribute for the bounding boxes (data-bounding-boxes) and the paragraphs on the html. Then we can either export that to the casebody_cache and extract the data from there or export directly to the files that will live on S3. (Conversation that led to this conclusion on Slack).

AC: