ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.12k stars 1.02k forks source link

Issue in Orientation #279

Closed prabhu92m closed 6 years ago

prabhu92m commented 6 years ago

Hi Team,

I facing orientation issue(wrongly orientated) while processing OCR in the attached PDF file. I have change the orientation threshold value to 0.002, even though the page is wrongly orientated on the your latest package.

Kindly do the needful ASAP.

jbarlow83 commented 6 years ago

It seems that the PDF was not attached. I'm also not clear on whether you are using --rotate-pages and that is causing incorrect rotation, or rotation was changed unexpectedly. Please provide your command line.

I probably won't be able to address for over a week. Please remember this is a voluntarily open source project. However, since you seem to require priority report, perhaps a commercial support contract would be of interest to you. If you wish to discuss that please reach to me: [EMAIL] – there are many ways I may be able to help you with your projects.

prabhu92m commented 6 years ago

The delay is not a problem brother, I just wants to find the root cause for this issue. Here is the attached file. Orientated pdf.pdf. Add I am using --rotate-pages for incorrect rotation.

prabhu92m commented 6 years ago

I think i address the issue which may due to the negative values while forming the correction variable in the orient_page method in the _pipeline.py script.

I fix this issue by simply validate by using an if condition which is i mentioned below.

if pdfinfo[pageno].rotation > orient_conf.angle:
    correction = pdfinfo[pageno].rotation - orient_conf.angle
else:
    correction = (orient_conf.angle - pdfinfo[pageno].rotation) % 360
jbarlow83 commented 6 years ago

Thank you for this report. The fix was a little more involved and your change probably does not cover all cases. I've added more better cases for rotation as well.

Despite this, the confidence is quite low on the file you submitted, and --rotate-pages (at least for me) will still misrotate some pages because Tesseract guesses the text orientation incorrectly. Therefore the wrong correction is applied. When Tesseract gets the orientation right, the final orientation is now correct.

Fixed in 6ef2651.