Open matthuszagh opened 1 year ago
It turns out this is a limitation of Tesseract's PDF renderer rather than Tesseract OCR's engine, which makes it tractable.
There are commonly technical reference books that have inset data tabulations as a rotated landscape page on Letter layout i.e. the chapter title at top and page numbering at bottom are normal orientation but the content of the page (like a wide spreadsheet) has been rotated 90° to better fit the page. If you --rotate-pages
then sure yeah the content of that data tabulation gets rotated and OCR'ed but then the whole page is rotated in the output which doesn't match the rest of the document and the chapter title + page numbering get a wrong recognition as garbage text. The reverse is also true if you don't rotate then the data table gets a garbage recognition but it stays oriented correctly compared to the rest of the pages for that book and the chapter title + page number are recognized correctly.
How is this supposed to be interacted with in PDF workflow?
Describe the proposed feature
It would be useful to be able to perform OCR on pages with multiple text rotations and have the OCR apply to each rotated portion correctly. Is this something that is currently possible, or may be possible in the future? I expect this would require running tesseract once each for 0, 90, 180, and 270deg rotation, using only the text portions for that rotation (is that possible?), and then stitching together the results.