ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.08k stars 1.02k forks source link

[Feature]: OCR on pages with multiple text rotations #1110

Open matthuszagh opened 1 year ago

matthuszagh commented 1 year ago

Describe the proposed feature

It would be useful to be able to perform OCR on pages with multiple text rotations and have the OCR apply to each rotated portion correctly. Is this something that is currently possible, or may be possible in the future? I expect this would require running tesseract once each for 0, 90, 180, and 270deg rotation, using only the text portions for that rotation (is that possible?), and then stitching together the results.

jbarlow83 commented 1 year ago

It turns out this is a limitation of Tesseract's PDF renderer rather than Tesseract OCR's engine, which makes it tractable.

eshattow commented 11 months ago

There are commonly technical reference books that have inset data tabulations as a rotated landscape page on Letter layout i.e. the chapter title at top and page numbering at bottom are normal orientation but the content of the page (like a wide spreadsheet) has been rotated 90° to better fit the page. If you --rotate-pages then sure yeah the content of that data tabulation gets rotated and OCR'ed but then the whole page is rotated in the output which doesn't match the rest of the document and the chapter title + page numbering get a wrong recognition as garbage text. The reverse is also true if you don't rotate then the data table gets a garbage recognition but it stays oriented correctly compared to the rest of the pages for that book and the chapter title + page number are recognized correctly.

How is this supposed to be interacted with in PDF workflow?