Support PDF generation/compression without hOCR files

internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.

GNU Affero General Public License v3.0

97 stars 13 forks source link

Open MerlijnWajer opened 3 years ago

MerlijnWajer commented 3 years ago

This should be a no-brainer, but we need to deal with a few things:

We use hOCR files to estimate the page size based on the DPI encoded in the hOCR files (if present), otherwise we estimate it.
The code that generates the initial PDF with text layer obviously relies on hOCR. We could just make a PDF with empty pages of the right size as alternative when we have no hOCR.