Closed ThatOtherGuy7 closed 6 years ago
We are transforming to TXT, I don't know any other way by using default Alfresco Java API. Probably you could use PDFBox to inspect the Content and to extract the right layer.
I see. Thanks for your quick response. I will give it a try with PDFBox like you said but im not sure how to do it with javascript (yet) but it cant be that difficult to figure it out.
PDFBox can be only accessed by using JAVA.
Yea i just noticed. I just remembered that i can get the plain text with the pdfbox-app-x.y.z.jar but i have no idea how to implement that to use it as a rule script.
Hi, i successfully implemented simple-ocr with tesseract into Alfresco on Linux/Ubuntu and everything works fine. I can OCR a document and search via the Live-Search and the advanced search. Now i want to get the content of the PDF document, read some information from it and add it to the properties. When i do "document.getContent()" on a file (w/ and w/o OCR applied) i only get the encoded PDF content. I thought, since the file got OCR'd, there is a second stream inside the PDF which is just plain text but thats not the case. Is there a way to extract or simply get the OCR Plain Text layer out of the PDF? If yes how can one do that?
I know i can transform the file into an .TXT format, extract the content i need, add it to the original PDF file and delete/store the .TXT file but thats a lot of effort just to add a value to the properties.