Open pinballelectronica opened 1 year ago
OCRmyPDF does have --rotate-pages
to use Tesseract OCR for orientation and script detection. Although, a poor scan or page that is light on text, this does not work well. You may need to use --rotate-pages-threshold
to rotate more often since the default setting is fairly cautious (it wants fairly strong evidence that the page is misrotated).
It is much better to rotate pages before OCR.
A few thoughts on the general problem:
What were you trying to do?
I am (intentionally) trying to find vast amounts of horrible PDF's to train a model to automatically process REALLY bad scans. I found this monstrosity replete with everything you could hate in a scan including wavy letters, off angle scanning, diagonal repetition of scanlines, positional scanning problems, and etc...
https://www2.census.gov/library/publications/1921/compendia/statab/43ed/1920-02.pdf
And have been wrenching on it for a little bit to preprocess it clean(er). I'm not sure if there's a "best practice" for working on this other than going to D.C. and re-scanning it which is quite possibly easier. I ran this through some AI cleanup to improve recognition prior or OCR but it looks to have either made it worse or only marginally better.
How would I OCR pages that are rotated (and leave them rotated? manual rotate, ocr, and rotate back?). I have another program that does it POORLY (like it generates total garbage). Some charts in this document are landscape mode so the rotation is intentional. See attached.
Thanks for any hints or thoughts! I really need to figure out something 100% automatable.
Where are you installing from?
Linux package manager (apt, dnf, etc.)
What operating system are you working on?
Linux
Relevant log output