nextcloud / files_fulltextsearch_tesseract

OCR your documents before index
GNU Affero General Public License v3.0
30 stars 13 forks source link

PDF Image Extraction does not auto-rotate landscape pages #12

Open andrewborell opened 5 years ago

andrewborell commented 5 years ago

This may be a problem with tesseract, or a setting that can be applied when creating the instance to ocr as an option -- not sure if that is even the best place to address the issue to be honest. I found in PDF documents which contain scanned images, if the image rotation is incorrect, which in the case of english is LTR, the OCR does not work and thereby indexing does not happen. Im thinking if I recall correctly this is usually a function of any decent scanner software, to duplex and auto-rotate pages properly. I only tested 90 degree rotation clockwise. Mirrored page scans probably would have the same problem if 180 degree ( upside down ) embedded images in PDF also fail. I have only been able to test by printing PDF images with microsoft pdf printer which does not auto-rotate the images and bullzip which corrects the rotation on pages. Im eager to test with my office scanner to see if rotation is handled well in the cannon scanning process.

If not for any other reason I post this to help educate others on what is acceptable input for ocr work. If it was a sideways text object I suspect it would have worked, but it simply does not work if the text is not displayed LTR in the document for embedded images.

XueSheng-GIT commented 11 months ago

Just as a side note: Tesseract option "-psm 0" could be used to detect the orientation of a page. Based on that, the image could be transformed and finally ocr could be done.