Complete train wreck of a PDF, trying to OCR rotated.

What were you trying to do?

I am (intentionally) trying to find vast amounts of horrible PDF's to train a model to automatically process REALLY bad scans. I found this monstrosity replete with everything you could hate in a scan including wavy letters, off angle scanning, diagonal repetition of scanlines, positional scanning problems, and etc...

https://www2.census.gov/library/publications/1921/compendia/statab/43ed/1920-02.pdf

And have been wrenching on it for a little bit to preprocess it clean(er). I'm not sure if there's a "best practice" for working on this other than going to D.C. and re-scanning it which is quite possibly easier. I ran this through some AI cleanup to improve recognition prior or OCR but it looks to have either made it worse or only marginally better.

How would I OCR pages that are rotated (and leave them rotated? manual rotate, ocr, and rotate back?). I have another program that does it POORLY (like it generates total garbage). Some charts in this document are landscape mode so the rotation is intentional. See attached.

Thanks for any hints or thoughts! I really need to figure out something 100% automatable.

000033_rasterize

Where are you installing from?

Linux package manager (apt, dnf, etc.)

What operating system are you working on?

Linux

Relevant log output

WSL2, Ubuntu 22.04

# wc -l dia.txt
29 dia.txt

29 pages with lots of diacritics

OCR: 100%|██████████████████████████████████████████████████████████████████████| 107.0/107.0 [00:24<00:00,  4.32page/s]
Postprocessing...
PDF/A conversion: 100%|█████████████████████████████████████████████████████████████| 107/107 [00:08<00:00, 12.16page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: -0.3%
Image optimization did not improve the file - optimizations will not be used
Output file is a PDF/A-2B (as expected)

OCRmyPDF does have --rotate-pages to use Tesseract OCR for orientation and script detection. Although, a poor scan or page that is light on text, this does not work well. You may need to use --rotate-pages-threshold to rotate more often since the default setting is fairly cautious (it wants fairly strong evidence that the page is misrotated).

It is much better to rotate pages before OCR.

A few thoughts on the general problem:

While rescanning never appeals, the time and money spent solving hard problems in poor scans can easily exceed the time and money spent rescanning.
An AI model trained to visually enhance an image in a human's opinion is not necessarily going to improve OCR accuracy. An AI model trained for OCR has probably learned some of the same enhancement techniques in its intermediate layers.

ocrmypdf / OCRmyPDF