keensoft / alfresco-simple-ocr

Simple OCR action for Alfresco
Other
44 stars 30 forks source link

How to restrict OCR (PDFSandwich) for Searchable Documents (PDF)? #54

Open DEEPAK-KESWANI opened 5 years ago

DEEPAK-KESWANI commented 5 years ago

BUG: OCR (PDFSandwich) is getting executed for Searchable Documents (PDF) as well.

Expected behavior: OCR should not process documents already containing text or searchable file.

Actual behavior: OCR is getting executed for Searchable Documents as well.

Steps to reproduce the behavior: Uploaded text containing PDF files which is also being processed for OCR.

Please help me on this.

Tell us about your environment: Linux

angelborroy-ks commented 5 years ago

There is no way to be sure that a PDF document is scanned or searchable. For PDF format both are documents and both have text inside.

If you can provide any algorithm, technique or whatever to identify a scanned PDF document, we'll include this feature in the addon.

Manucciu commented 5 years ago

Hello, there is one simple javascript for know if pdf containt already ocr or not 👍

var transformedPdfFolder = space.createFolder("_temp_txtfolder"); var transformedPdfFile = document.transformDocument("text/plain", transformedPdfFolder)

if (transformedPdfFile.content.match(/./)) (don t do extract OCR) else do it.

I would like to do this on folder, actually, i do the javascript if not ocr move on folder then do the ocr and the doc back on the first folder.

It s not perfect. If you have a better solution.

Cheer