internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
104 stars 14 forks source link

HOCR rendering compares unfavorably with tesseract PDF text layer #63

Open jrochkind opened 1 year ago

jrochkind commented 1 year ago

Using recode_pdf (internetarchivepdf 1.5.2) and tesseract (5.3.0).

I have three examples single-pages, where I:

  1. have tesseract make a full PDF from OCR, via eg tesseract identifier.tiff identifier.tesseract -l eng pdf
  2. Have tesseract output HOCR, and feed the HOCR and TIFF to recode_pdf, via eg:
    • tesseract identifier.tiff identifier.tesseract -l eng hocr
    • recode_pdf --bg-downsample 3 --from-imagestack identifier.tiff --hocr-file identifier.tesseract.hocr -o identifier.recode_pdf.pdf

I am finding that the text layout by the second process involving recode_pdf is not identical, and is inferior to, the text layout tesseract produces itself. I have put all my sample files on an S3 bucket for investigation, although I don't know if they will stay there forever.

Simple page

A simple, clear textual book page

if you select text in the PDF, the recode_pdf one has much smaller height than the tesseract one, the selection bar does not go all the way to top of ascenders like it does with tesseract one.

In this case, the recode_pdf version is perfectly usable, but it demonstrates not identical.

Screen Shot 2023-03-28 at 2 14 31 PM Screen Shot 2023-03-28 at 2 14 40 PM

Somewhat more complicated page

This one is also a book page, but has a figure in the middle of the page interupting text, some background coloration, and the photography was not perfectly squared so text is somewhat diagonal.

in this one, the recode_pdf-generated textual data all seems to have double-height, making it very confusing to select text, and making highlights on search-within-the-pdf results also very confusing, a definite usability issue. I only have one line of text selected in these screenshots.

Screen Shot 2023-03-28 at 2 19 35 PM Screen Shot 2023-03-28 at 2 19 15 PM

More complex graphical page

This is a graphical advertisement that only has a little bit of text on it, at various places and in various fonts.

This one is harder to explain/demonstrate. And for me only reproduces the problem in MacOS Preview.

If I open the recode_pdf PDF in MacOS Preview (with "Live Text" disabled, yup), and try to drag to select the line at "effective residual deposit", I can't select the whole line -- the layout of the text is leading the PDF reader to think there is a column there or something.

This one does not reproduce in Chrome PDF viewer, selection works okay there. But reproduces in MacOS preview, these screen recordings are from there. (I disable "Live Text" in my MacOS settings to ensure that what I'm seeing in Preview is embedded text data from PDF only, not OCR that MacOS Preview does itself on-the-fly under the branding "Live Text"!) I realize text-order in PDFs is a heuristic applied by the viewer, but this demonstrates that something in the layout was different -- and the layout from tesseract led to succesful heuristic in MacOS Preview, and the one from here did not.

Screen Recording 2023-03-28 at 2 31 42 PM Screen Recording 2023-03-28 at 2 31 08 PM

What's going on?

I know you originally ported the HOCR rendering from tesseract. Brainstorming....

MerlijnWajer commented 1 year ago

Thanks for the report, it will take me a bit of time to figure out what is up here.

One thing that comes to mind is that it is possible both with Tesseract and archive-pdf-tools to generate a PDF that doesn't have images in it, that might make comparing them easier. (And you can decompress some of the compressed text layers potentially to make it easier to diff them)

I checked the Tesseract git history just now, of src/api/pdfrenderer.cpp and I don't see any significant changes since I did the port. I think a bug in the port is probably the most likely problem here, it is also possible that some code that I added to take DPI into account messes with the quality. (For example, perhaps, if you upload this image to archive.org, it works fine, but there is a problem with the path that is not battle tested). So of the questions you list, I think the first one is the most likely one, potentially the third/last one is also relevant.

jrochkind commented 1 year ago

Thanks for the reply! All my source materials are linked to from this ticket, if anyone wants to try to experiment with them further, such as generating text-only PDFs and then diffing text layers etc, or uploading to internet archive. It would take me a while to learn to use tools enough to do things like try to decompress and diff text layers, so it may take me a bit of time to get to that (or not) too!

For example, perhaps, if you upload this image to archive.org, it works fine

Hm, now this makes me curious -- is running recode_pdf not the same thing that archive.org does when you upload? I had naively assumed it would be the same. (Although of course there is always the possibility of different versions of tesseract involved etc).

MerlijnWajer commented 1 year ago

There are a few differences, mainly relating to how DPI is provided and handled. I don't have a mac, so I would not be able to text the mac preview specific things, but I can test with evince, firefox pdf.js and mupdf.

jrochkind commented 1 year ago

Cool, yeah, I'd guess the underlying thing is just trying to figure out why the text layer differs at all, and if it can be made the same or more the same -- clearly the differences are leading MacOS Preview's heuristics to make different determinations of what constitutes a column of text, but even without access to MacOS Preview, the curious thing would be why/how the text element positioning differs in the first place!

jrochkind commented 1 year ago

Oh and I will say, in none of my tests did I supply a "dpi" argument to any tool. I don't know if tesseract and recode_pdf will extract the dpi from the TIFF source, or what. I had enough variables going on in my testing that I decided to just ignore providing dpi in an argument and let the tools do what they will. It would be interesting if that would make a difference, if supplying a dpi to recode_pdf would result in different/better text positioning!

If you think that is plausible, I could try it? Just supply the known dpi of the original TIFF? To both tesseract and recode_pdf I guess?

MerlijnWajer commented 1 year ago

In the case of the simple image, it already has a DPI embedded, so supplying --dpi 400 in this case won't make a difference. I haven't yet tested on the other files, but will try to take a look later this week. I also see a difference in evince between Tesseract's text and archive-pdf-tools' text.

MerlijnWajer commented 1 year ago

FYI: Tesseract also has a text only pdf option, pass -c textonly_pdf=1 to get a PDF without the images. I'll try to use that this week to debug/compare to the archive-pdf-tools output.

jrochkind commented 1 year ago

Awesome, thanks! All three source TIFFs should have dpi metadata embedded. The second image is 600dpi, the first and third are 400dpi.

jrochkind commented 1 year ago

(Also, while it's a different issue, the third example, "More complex graphical page" has pretty significant visible artifacts as a result of the MRC compression applied by recode_pdf. the other two do not)

jrochkind commented 1 year ago

Note: img2pdf that we use to embed a jp2 into a single-page PDF -- can only work on jp2's without an "alpha" transparency layer/channel.

I think our TIFFs should not have an alpha channel. And then the jp2's we make from them should not have an alpha layer. And it should not be a problem.

But it has been a problem with some of our test data -- for instance, for whatever reason MacOS Preview seems to add an alpha channel to everything it saves.

If you try to use this on something with an alpha channel, you'll get an error from img2pdf, a TTY::Command::ExitError This function must not be called on images with alpha

Maybe I'll wrap that in a better error.

Note: This is slow at present. MUCH slower than previous PDF generation. We may look at speeding it up. We may look at caching individual PDF pages.

artunit commented 1 year ago

It looks like pdfrenderer.py calculates a font size that is too large in some cases. There is some logic to deal with line slope and other factors that are beyond me, but I think that most calculations fall back on the default font size and this is where the rendering is consistent with Tesseract (I wonder if Tesseract also falls back on a default font size in most cases). I am using a simple pdfrenderer.patch for our major papers project. It will still lean toward missized fonts without specifiying DPI, and our pages tend to be produced with: recode_pdf --dpi 300 --from-imagestack mpaper.jpg --hocr-file mpaper.hocr -o mpaper.pdf This is a pure hack on my part, I notice that the Prima folks took the approach of mapping font metrics by word instead of line, but PDF text mapping seems tricky with either approach and I think the line mapping is more consistent with other PDFs. I love the compact PDFs that _recodepdf produces.