This should be a no-brainer, but we need to deal with a few things:
We use hOCR files to estimate the page size based on the DPI encoded in the hOCR files (if present), otherwise we estimate it.
The code that generates the initial PDF with text layer obviously relies on hOCR. We could just make a PDF with empty pages of the right size as alternative when we have no hOCR.
This should be a no-brainer, but we need to deal with a few things: