keensoft / alfresco-simple-ocr

Simple OCR action for Alfresco
Other
44 stars 30 forks source link

Disabling page auto-rotations #57

Open ajab21 opened 5 years ago

ajab21 commented 5 years ago

Is there a way to still run Alfresco Simple OCR (w/ pdfsandwich) on each new document version (so text can continue to be found on the pages) yet kill the auto-rotation portion of the process for subsequent versions of the document after 1.0? The business scenarios here is to avoid manual page rotations (i.e. corrections to improper automatic orientation) from being recursively overridden by the automatic processing. Our thought process to resolve this issue is to consider writing programming logic to consider what the version of the document is in order to apply auto-rotations or not. In other words, apply automatic page rotations to the very first version 1.0, but don't so on any subsequent version edits when manually changes/corrections could have been made. Of course, this is dependent on whether we’re able to pass a command to Simple OCR and/or pdfsandwich to conditionally disable the auto-rotation portion of the process. Is this possible to do? If so, do you know the code or command we need to employ in order to achieve this?

Stepping back, just wondering if you’re heard of this problem before and any other approaches you know of that we may want to consider (instead of the idea described above) to overcome it.

Thank you!


Here's more background:

There are anomalies with some kinds of scanned documents being uploaded where automation logic is not able to determine the page rotation correctly. Auto-rotations is based on what the process finds on the page and how it believes text direction should flow. But, there are times when pages have text flowing in conflicting directions (i.e. some block of text goes one way, and other block of text goes a different way – not to mention times when text is handwriting and not computer-generated). So, when the auto-rotation ends up being incorrect for understandable reasons, the user will proceed by manually rotating the page and then saving changes before adding annotations (via another third-party tool). This results in a new document version in Alfresco, which next triggers Simple OCR / pdfsandwich to run once again against the new version. What happens next is that automatic process reverses the user’s manual correction and ends up auto-rotating the page back to the incorrect orientation. The next time a user views the document, they see the rotation incorrect again plus annotation layer that is no longer corresponding to the proper coordinates of the page. At this point, manually rotating the page in the UI document viewer results in the annotation being rotated incorrectly and often in an illegible manner. The problem is recursive in nature and any annotations added (as they often will be) end up making the problem that much worse.

angelborroy-ks commented 5 years ago

Adding -nopreproc option to ocr.extra.commands parameter could solve your issue. Detailed information on pdfsandwich options is available at http://www.tobias-elze.de/pdfsandwich/

ajab21 commented 5 years ago

@angelborroy-ks thanks for your quick reply back.

Unfortunately, -nopreproc is not disabling auto-rotation as expected. We've reached out to Tobias for more insight about pdfsandwich options, so hope we can resolve with his assistance.

If it turns out we need to explore other options, do you have any recommendations on other tools to use for searchable image layer? As background, we're using tesseract for base OCR at metadata document level, so the gap we're needing to fill is allowing users to search for words and jump to pages within a document based on matches on top of page image layer.

Many thanks in advance for your continued guidance!! We're about 1-2 weeks from first production launch of Alfresco, and this issue (plus another issue related to serious pixel loss after pdfsandwich runs) is causing showstopper concern for the project. So, we're scrambling for ideas on how to resolve.

angelborroy-ks commented 5 years ago

Did you tried ocrmypdf? This software includes many different options to deal with Tesseract parameters.

ajab21 commented 5 years ago

No, I haven't, but I was just looking into it in fact. Based on your experience, would you say OCRmyPDF is a more sophisticated tool and may be better suited for our needs given the problem at hand? We saw you mentioned pdfsandwich first in the list in your write-up, so maybe we assumed incorrectly that's the one you had more preference for.

angelborroy-ks commented 5 years ago

Yes, I suggested OCRmyPDF for your use case because it's more customisable than pdfsandwich. We could say that pdfsandwich is good basic tool (enough for many users) but OCRmyPDF is an expert tool (what requires more tuning and expertise).

Let me know how it goes.

ajab21 commented 5 years ago

Thanks! Will do.

ajab21 commented 5 years ago

FYI, just an update that OCRmyPDF is working out much better with the addt'l options. Thanks again!

DEEPAK-KESWANI commented 5 years ago

Hi,

Please see the attached image where it shows the output PDF is getting distorted on each ocrmypdf command.

distorted_from_v1 0_to_v1 4

FYI, we are using auto-rotate options (--rotate-pages --rotate-pages-threshold 1) only for 1st version and for the rest versions PDF, we are not using the auto-rotate option.

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --rotate-pages --rotate-pages-threshold 1 v_1.0.pdf v_1.1.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf v_1.1.pdf v_1.2.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf v_1.2.pdf v_1.3.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf v_1.3.pdf v_1.4.pdf

NOTE: OCRMyPDF version: 7.0.0

Could you please help me on this?

Also, if I add --oversample 600 option to command in each version, it works fine but output pdf size has increased.

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 --rotate-pages --rotate-pages-threshold 1 v_2.0.pdf v_2.1.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 v_2.1.pdf v_2.2.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 v_2.2.pdf v_2.3.pdf

sudo ocrmypdf --verbose 1 --force-ocr -l eng --output-type pdf --oversample 600 v_2.3.pdf v_2.4.pdf

Thanks.

angelborroy-ks commented 5 years ago

I'm not OCR expert. Probably you'll get better answers at OCRmyPDF project.