This was implemented for #15 and is working. But I'm having second thoughts about this feature:
Pro
Makes it easier to see which words got recognized by the OCR, and which got missed
Cons
The size of the .chaya files gets much larger (TODO: quantify), because of having to store bounding box for each word
In-browser memory usage increases (TODO: quantify), for the same reason. This may make editing sluggish (or even cause tab to crash) for large PDFs or on devices with less memory.
Code complexity increases (Not just the code to highlight these areas, but also the interaction between pdf.js output being at the page level v/s this information being at the line level: eg when loading from file, cannot simply add one chunk at a time, but need to accumulate chunks until a page is done)
It only shows which words were recognized in the initial OCR response, which is something that may not remain meaningful after manual editing
It's not clear how useful it is: how valuable is "not recognized", when anything else could just as well be "incorrectly recognized"? (OCR systems probably differ in whether they will always output something, or will just leave out uncertain words)
A deficiency of the current implementation: The words unrecognized by OCR actually get dimmed (when in fact we should probably highlight them more)
Subjective: Looks ugly?
Alternatives
Some alternative ways to show which words were recognized:
Show lines with tight fit, i.e. only display the part of each line that corresponds to the recognized words (This is dangerous as it may leave out precisely the words we want to catch)
Indent each line of the editable text by an indent corresponding to the first recognized word in the image. This is more likely to place recognized text directly under the corresponding image (at least at roughly similar zoom sizes)
That thing that scribeocr.com does, of superimposing the text (rendered in some font with matching font metrics) onto the image.
A concrete next step may be to have another branch with this feature removed, and measure how much we save in file size, memory usage, and code complexity.
This was implemented for #15 and is working. But I'm having second thoughts about this feature:
Pro
Cons
Alternatives
Some alternative ways to show which words were recognized:
Show lines with tight fit, i.e. only display the part of each line that corresponds to the recognized words(This is dangerous as it may leave out precisely the words we want to catch)