Closed shreevatsa closed 5 months ago
Probably the PM model should be the "container", and hold the PDF bounding boxes and also the corresponding text.
PM's "view" will take care of rendering the PDF page (bounding box) images, and the OCRed text.
So schema something like:
UnbrokenPage
, where UnbrokenPage
is just a page number.Rectangle
(or Line
may be a better name?), where Rectangle
= <bbox, text>
.Wondering what to do with individual words. In case of Google OCR I have some logic for splitting into lines, maybe standardize on words?
As an initial step / getting feet wet, let's just turn the page into a PM-controlled thing with each page individually.
Can use:
as reference.
This was basically a "Get started" issue; renamed and closing now. The ProseMirror schema / document model will evolve as work on this code continues; it's not something that can be declared "done".