Document is changed by pdfsandwich but not searchable

jacotec commented 7 years ago

Hi,

I just need to get OCR to run before I can put Alfresco to production on a private/non-profit use. However, even after a day of try and error I can't get out of one issue.

The documents are OCR'ed when I put them into the folder with the rule, I see the convert.bin and tesseract threads running in top, afterwards the document got a new version number by "OCRd". But I can't search it. When I download the new PDF and open it in Acrobat, I still can't search. If I let Acrobat run its OCR on it and reupload it as a new version, also Alfresco will find the content.

Does anyone have an idea what's going on here? And help is highly appreciated :-)

` Checking for convert: convert -version Version: ImageMagick 6.9.1-10 Q16 x86_64 2015-08-12 http://www.imagemagick.org Copyright: Copyright (C) 1999-2015 ImageMagick Studio LLC License: http://www.imagemagick.org/script/license.php Features: Cipher DPC Modules Delegates (built-in): freetype jng jpeg ltdl png tiff wmf

Checking for unpaper: unpaper -version 6.1 Checking for tesseract: tesseract -v Checking for gs: gs -v GPL Ghostscript 8.64 (2009-02-03) Copyright (C) 2009 Artifex Software, Inc. All rights reserved. Input file: "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3313619068351155391.pdf" Output file: "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3313619068351155391_ocr.pdf" Number of pages in inputfile: 1 More threads than pages. Using 1 threads instead. Processing page 1. identify -format "%w\n%h\n" "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3313619068351155391.pdf[0]" convert -type Bilevel -density 300x300 "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3313619068351155391.pdf[0]" /tmp/pdfsandwich4e4c4b.pbm unpaper --overwrite --no-grayfilter --layout none /tmp/pdfsandwich4e4c4b.pbm /tmp/pdfsandwich28b7bf_unpaper.pbm Processing sheet #1: /tmp/pdfsandwich4e4c4b.pbm -> /tmp/pdfsandwich28b7bf_unpaper.pbm tesseract /tmp/pdfsandwich28b7bf_unpaper.pbm /tmp/pdfsandwich81f2bb -l deu pdf gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dDEVICEWIDTHPOINTS=595 -dDEVICEHEIGHTPOINTS=816 -dPDFFitPage -o /tmp/pdfsandwichb52248.pdf /tmp/pdfsandwich81f2bb.pdf OCR done. Writing "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3313619068351155391_ocr.pdf" gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3313619068351155391_ocr.pdf" /tmp/pdfsandwichb52248.pdf

Done.

2017-05-26 19:28:55,904 INFO [es.keensoft.alfresco.ocr.OCRTransformWorker] [defaultAsyncAction1] STDERR: unpaper: /opt/alfresco-community/common/lib/libz.so.1: no version information available (required by /usr/lib/x86_64-linux-gnu/libavformat-ffmpeg.so.56) unpaper: /opt/alfresco-community/common/lib/libz.so.1: no version information available (required by /usr/lib/x86_64-linux-gnu/libavcodec-ffmpeg.so.56) tesseract 3.04.01 leptonica-1.73 libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.3 : libwebp 0.4.4 : libopenjp2 2.1.0

unpaper: /opt/alfresco-community/common/lib/libz.so.1: no version information available (required by /usr/lib/x86_64-linux-gnu/libavformat-ffmpeg.so.56) unpaper: /opt/alfresco-community/common/lib/libz.so.1: no version information available (required by /usr/lib/x86_64-linux-gnu/libavcodec-ffmpeg.so.56) [image2 @ 0x1018900] Encoder did not produce proper pts, making some up. Tesseract Open Source OCR Engine v3.04.01 with Leptonica `

jacotec commented 7 years ago

OK ... after reading the FAQ three times and understanding what is meant I got it running by using ocrmypdf and putting the ocr.command into a shell script ;-)

Sorry for that ... maybe it'll help someone

aboyz commented 7 years ago

I'm running into the same issues. Can you tell me step by step on how to fix it? what ocr command yo put into the shell script and where?

angelborroy-ks commented 7 years ago

You have different options described at the FAQ page.

keensoft / alfresco-simple-ocr

Document is changed by pdfsandwich but not searchable #29