keensoft / alfresco-simple-ocr

Simple OCR action for Alfresco
Other
44 stars 30 forks source link

Problem during ocr #25

Closed tap90 closed 7 years ago

tap90 commented 7 years ago

When I put some pdf files in the alfresco folder configured with the rule (ocr-extraction) Alfresco creates a new version of the file without perform ocr correctly.

When this happens It writes this in the alfresco.log: `Version: ImageMagick 6.9.1-10 Q16 x86_64 2015-08-12 http://www.imagemagick.org Copyright: Copyright (C) 1999-2015 ImageMagick Studio LLC License: http://www.imagemagick.org/script/license.php Features: Cipher DPC Modules Delegates (built-in): freetype jng jpeg ltdl png tiff wmf

Checking for unpaper: unpaper -version *** error: Unknown parameter '-version'. Try 'unpaper --help' for options. Checking for tesseract: tesseract -v Checking for gs: gs -v GPL Ghostscript 8.64 (2009-02-03) Copyright (C) 2009 Artifex Software, Inc. All rights reserved. Input file: "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572.pdf" Output file: "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572_ocr.pdf" Number of pages in inputfile: 1 Processing page 1. identify -format "%w\n%h\n" "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572.pdf[0]" convert -type Bilevel -density 300x300 "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572.pdf[0]" /tmp/pdfsandwichf66a6b.pbm unpaper --overwrite --no-grayfilter --layout none /tmp/pdfsandwichf66a6b.pbm /tmp/pdfsandwich5838df_unpaper.pbm Processing sheet: /tmp/pdfsandwichf66a6b.pbm -> /tmp/pdfsandwich5838df_unpaper.pbm tesseract /tmp/pdfsandwich5838df_unpaper.pbm /tmp/pdfsandwich0ca5f3 -l spa+eng+fra pdf gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dDEVICEWIDTHPOINTS=595 -dDEVICEHEIGHTPOINTS=842 -dPDFFitPage -o /tmp/pdfsandwich5264db.pdf /tmp/pdfsandwich0ca5f3.pdf OCR done. Writing "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572_ocr.pdf" gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572_ocr.pdf" /tmp/pdfsandwich5264db.pdf

Done.

2017-01-30 15:01:17,837 INFO [es.keensoft.alfresco.ocr.OCRTransformWorker] [http-apr-8080-exec-9] STDERR: tesseract: /opt/alfresco-community/common/lib/libjpeg.so.62: no version information available (required by /usr/local/lib/liblept.so.4) tesseract: /opt/alfresco-community/common/lib/libjpeg.so.62: no version information available (required by /lib64/libtiff.so.5) tesseract 3.04.01 leptonica-1.72 libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.3

tesseract: /opt/alfresco-community/common/lib/libjpeg.so.62: no version information available (required by /usr/local/lib/liblept.so.4) tesseract: /opt/alfresco-community/common/lib/libjpeg.so.62: no version information available (required by /lib64/libtiff.so.5) Tesseract Open Source OCR Engine v3.04.01 with Leptonica`

I have noticed this error:

STDERR: tesseract: /opt/alfresco-community/common/lib/libjpeg.so.62: no version information available (required by /usr/local/lib/liblept.so.4)

Can it generate this problem? How can I fix this?

Thanks in advance

angelborroy-ks commented 7 years ago

Maybe you can try the instruction by command line to debut the problem, as input file is still stored on your filesystem.

tap90 commented 7 years ago

Sorry for the ignorance Which command should I try?

angelborroy-ks commented 7 years ago

pdfsandwich /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572.pdf -o /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572_ocr.pdf -verbose -lang spa+eng+fra

tap90 commented 7 years ago

This is the output:

pdfsandwich version 0.1.4 Checking for convert: convert -version Version: ImageMagick 6.7.8-9 2016-06-16 Q16 http://www.imagemagick.org Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC Features: OpenMP

Checking for unpaper: unpaper -version *** error: Unknown parameter '-version'. Try 'unpaper --help' for options. Checking for tesseract: tesseract -v tesseract 3.04.01 leptonica-1.72 libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7

Checking for gs: gs -v gs: symbol lookup error: /lib64/libgs.so.9: undefined symbol: cmsCreateContext Input file: "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572.pdf" Output file: "/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572_ocr.pdf" gs: symbol lookup error: /lib64/libgs.so.9: undefined symbol: cmsCreateContext Fatal error: exception Failure("Error: Could not determine number of pages of file /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_6425226260248108572.pdf")

I think the problem start with this line when pdfsandwich try to use gs -v command gs: symbol lookup error: /lib64/libgs.so.9: undefined symbol: cmsCreateContext

I have try to use this command and the output is the same

angelborroy-ks commented 7 years ago

Ok, so you have to fix your local gs (GhostScript) installation before trying to run this Alfresco addon.

tap90 commented 7 years ago

Yes, the problem is that ghostscript is installed as dependency of ImageMagick so I don't understand because it doesn't work

angelborroy-ks commented 7 years ago

Different problems can be the cause, however, by executing gs -v you'd receive exactly the same error.

Maybe isolating the problem will help to solve it.

tap90 commented 7 years ago

I have try to remove and install ghostscript now the command gs -v work correctly but when I execute pdfsandwich a new error is generated by ghostscript

pdfsandwich version 0.1.4 Checking for convert: convert -version Version: ImageMagick 6.7.8-9 2016-06-16 Q16 http://www.imagemagick.org Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC Features: OpenMP

Checking for unpaper: unpaper -version *** error: Unknown parameter '-version'. Try 'unpaper --help' for options. Checking for tesseract: tesseract -v tesseract 3.04.01 leptonica-1.72 libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7

Checking for gs: gs -v GPL Ghostscript 9.07 (2013-02-14) Copyright (C) 2012 Artifex Software, Inc. All rights reserved. Input file: "/home/ocrserver/prova.pdf" Output file: "/home/ocrserver/prova_ocr.pdf" GPL Ghostscript 9.07: Unrecoverable error, exit code 1 Fatal error: exception Failure("Error: Could not determine number of pages of file /home/ocrserver/prova.pdf")

This is the line:

GPL Ghostscript 9.07: Unrecoverable error, exit code 1 Fatal error: exception Failure("Error: Could not determine number of pages of file /home/ocrserver/prova.pdf")

Now I'm trying to install the latest version of pdfsandwich I let you know

tap90 commented 7 years ago

With pdfsandwich works correctly It use pdfunite so you need to install this new dependency but It works

DEEPAK-KESWANI commented 6 years ago

Hi, I'm unable to use below command directly from Terminal on Ubuntu for .tif (Multipages tif file) to .pdf file.

Can you please help on this?

$ /usr/bin/pdfsandwich -verbose -lang spa+eng+fra Sample_3_Multi_page.tif -o Sample_3_Multi_page.pdf pdfsandwich version 0.1.4 Checking for convert: convert -version Version: ImageMagick 6.8.9-9 Q16 x86_64 2018-07-10 http://www.imagemagick.org Copyright: Copyright (C) 1999-2014 ImageMagick Studio LLC Features: DPC Modules OpenMP Delegates: bzlib cairo djvu fftw fontconfig freetype jbig jng jpeg lcms lqr ltdl lzma openexr pangocairo png rsvg tiff wmf x xml zlib

Checking for unpaper: unpaper -version 6.1 Checking for tesseract: tesseract -v tesseract 3.04.01 leptonica-1.73 libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

Checking for gs: gs -v GPL Ghostscript 9.18 (2015-10-05) Copyright (C) 2015 Artifex Software, Inc. All rights reserved. Input file: "Sample_3_Multi_page.tif" Output file: "Sample_3_Multi_page.pdf" Fatal error: exception Failure("Error: Could not determine number of pages of file Sample_3_Multi_page.tif")

Thanks.

Ikkache27 commented 5 years ago

Dear @angelborroy-ks, I'm using ubuntu 16.04, i have copied the two jar and have installed pdfsandiwh 0.1.4,please can you help me for this :

pdfsandwich version 0.1.4 Checking for convert: convert -version Version: ImageMagick 6.8.9-9 Q16 x86_64 2019-11-12 http://www.imagemagick.org Copyright: Copyright (C) 1999-2014 ImageMagick Studio LLC Features: DPC Modules OpenMP Delegates: bzlib cairo djvu fftw fontconfig freetype jbig jng jpeg lcms lqr ltdl lzma openexr pangocairo png rsvg tiff wmf x xml zlib

Checking for unpaper: unpaper -version 6.1 Checking for tesseract: tesseract -v tesseract 3.04.01 leptonica-1.73 libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.2

Checking for gs: gs -v GPL Ghostscript 9.26 (2018-11-20) Copyright (C) 2018 Artifex Software, Inc. All rights reserved. Input file: "/opt/alfresco-community/tomcat/temp/Alfresco/alice.pdf" Output file: "/opt/alfresco-community/tomcat/temp/Alfresco/alice_ocr.pdf" Fatal error: exception Failure("Error: Could not determine number of pages of file /opt/alfresco-community/tomcat/temp/Alfresco/alice.pdf")