getContent from OCR'd file

ThatOtherGuy7 commented 6 years ago

Hi, i successfully implemented simple-ocr with tesseract into Alfresco on Linux/Ubuntu and everything works fine. I can OCR a document and search via the Live-Search and the advanced search. Now i want to get the content of the PDF document, read some information from it and add it to the properties. When i do "document.getContent()" on a file (w/ and w/o OCR applied) i only get the encoded PDF content. I thought, since the file got OCR'd, there is a second stream inside the PDF which is just plain text but thats not the case. Is there a way to extract or simply get the OCR Plain Text layer out of the PDF? If yes how can one do that?

I know i can transform the file into an .TXT format, extract the content i need, add it to the original PDF file and delete/store the .TXT file but thats a lot of effort just to add a value to the properties.

angelborroy-ks commented 6 years ago

We are transforming to TXT, I don't know any other way by using default Alfresco Java API. Probably you could use PDFBox to inspect the Content and to extract the right layer.

ThatOtherGuy7 commented 6 years ago

I see. Thanks for your quick response. I will give it a try with PDFBox like you said but im not sure how to do it with javascript (yet) but it cant be that difficult to figure it out.

angelborroy-ks commented 6 years ago

PDFBox can be only accessed by using JAVA.

ThatOtherGuy7 commented 6 years ago

Yea i just noticed. I just remembered that i can get the plain text with the pdfbox-app-x.y.z.jar but i have no idea how to implement that to use it as a rule script.

keensoft / alfresco-simple-ocr

getContent from OCR'd file #46