keensoft / alfresco-simple-ocr

Simple OCR action for Alfresco
Other
44 stars 30 forks source link

libtiff.so.5 & liblept.so.5 problems on latest Alfresco 5.2 and Ubuntu 16.04 #47

Closed billydekid closed 6 years ago

billydekid commented 6 years ago

Hi, I have an issue after installing latest simple-ocr 2.3.1 with pdfsandwich ocr engine, Alfresco 5.2.0 and Ubuntu 16.04 LTS. All supporting apps installed with apt-get / dpkg method of installation as follows:

  1. TESSERACT:

    # apt-get install tesseract-ocr tesseract-ocr-eng tesseract-ocr-ind
    # tesseract -v
    tesseract 3.04.01
    leptonica-1.73
    libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
  2. PDFSANDWICH:

    # dpkg -i /media/sf_Downloads/Alfresco/Addons/simple-ocr/apps-supported/pdfsandwich_0.1.6_amd64.deb
    # apt-get -fy install
    # pdfsandwich -version
    pdfsandwich version 0.1.6

    Error when try to ocr document by clicking OCR button on document page:

    Exception in thread "defaultAsyncAction5" java.lang.RuntimeException: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 02290026 Failed to perform OCR transformation:
    Execution result:
    os:         Linux
    command:    /opt/alfresco-community/bin/bw-pdfsandwich.sh -verbose -lang eng+ind -rgb /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_2287725343312660166.pdf -o /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_2287725343312660166_ocr.pdf
    succeeded:  false
    exit code:  2
    out:        pdfsandwich version 0.1.6
    Checking for convert:
    convert -version
    Version: ImageMagick 7.0.5-2 Q16 x86_64 2017-04-04 http://www.imagemagick.org
    Copyright: © 1999-2017 ImageMagick Studio LLC
    License: http://www.imagemagick.org/script/license.php
    Featur
    err:        tesseract: /opt/alfresco-community/common/lib/libtiff.so.5: no version information available (required by /usr/lib/liblept.so.5)
    tesseract 3.04.01
    leptonica-1.73
    libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.56 : libtiff 4.0.7 : zli
        at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:183)
        at es.keensoft.alfresco.ocr.OCRExtractAction.access$200(OCRExtractAction.java:38)
        at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:164)
        at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:161)
        at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:464)
        at es.keensoft.alfresco.ocr.OCRExtractAction.executeInNewTransaction(OCRExtractAction.java:169)
        at es.keensoft.alfresco.ocr.OCRExtractAction.access$100(OCRExtractAction.java:38)
        at es.keensoft.alfresco.ocr.OCRExtractAction$ExtractOCRTask.run(OCRExtractAction.java:151)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
    Caused by: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 02290026 Failed to perform OCR transformation:
    Execution result:
    os:         Linux
    command:    /opt/alfresco-community/bin/bw-pdfsandwich.sh -verbose -lang eng+ind -rgb /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_2287725343312660166.pdf -o /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_2287725343312660166_ocr.pdf
    succeeded:  false
    exit code:  2
    out:        pdfsandwich version 0.1.6
    Checking for convert:
    convert -version
    Version: ImageMagick 7.0.5-2 Q16 x86_64 2017-04-04 http://www.imagemagick.org
    Copyright: © 1999-2017 ImageMagick Studio LLC
    License: http://www.imagemagick.org/script/license.php
    Featur
    err:        tesseract: /opt/alfresco-community/common/lib/libtiff.so.5: no version information available (required by /usr/lib/liblept.so.5)
    tesseract 3.04.01
    leptonica-1.73
    libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.56 : libtiff 4.0.7 : zli
        at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86)
        at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181)
        ... 10 more
    Caused by: org.alfresco.service.cmr.repository.ContentIOException: 02290026 Failed to perform OCR transformation:
    Execution result:
    os:         Linux
    command:    /opt/alfresco-community/bin/bw-pdfsandwich.sh -verbose -lang eng+ind -rgb /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_2287725343312660166.pdf -o /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_2287725343312660166_ocr.pdf
    succeeded:  false
    exit code:  2
    out:        pdfsandwich version 0.1.6
    Checking for convert:
    convert -version
    Version: ImageMagick 7.0.5-2 Q16 x86_64 2017-04-04 http://www.imagemagick.org
    Copyright: © 1999-2017 ImageMagick Studio LLC
    License: http://www.imagemagick.org/script/license.php
    Featur
    err:        tesseract: /opt/alfresco-community/common/lib/libtiff.so.5: no version information available (required by /usr/lib/liblept.so.5)
    tesseract 3.04.01
    leptonica-1.73
    libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.56 : libtiff 4.0.7 : zli
        at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:79)
        ... 11 more

    Following libtiff on server I found:

root@alflab:/usr# ls -l /usr/lib/x86_64-linux-gnu/libtiff.*
lrwxrwxrwx 1 root root     16 Mar 20 23:42 /usr/lib/x86_64-linux-gnu/libtiff.so.5 -> libtiff.so.5.2.4
-rw-r--r-- 1 root root 475496 Mar 20 23:42 /usr/lib/x86_64-linux-gnu/libtiff.so.5.2.4
root@alflab:/usr# ls -l /opt/alfresco-community/common/lib/libtiff.*
-rw-r--r-- 1 root root 781854 Jun 16  2017 /opt/alfresco-community/common/lib/libtiff.a
-rwxr-xr-x 1 root root   1099 Jun 16  2017 /opt/alfresco-community/common/lib/libtiff.la
lrwxrwxrwx 1 root root     16 Mar 24 22:54 /opt/alfresco-community/common/lib/libtiff.so -> libtiff.so.5.2.5
lrwxrwxrwx 1 root root     16 Mar 24 22:54 /opt/alfresco-community/common/lib/libtiff.so.5 -> libtiff.so.5.2.5
-rwxr-xr-x 1 root root 525016 Jun 16  2017 /opt/alfresco-community/common/lib/libtiff.so.5.2.5

It seems Ubuntu's leptonica is not match with Alfresco libtiff version. CMIIW.

How to fix this error?

Thank you, [bayu]

angelborroy-ks commented 6 years ago

You can try to switch to OCRmyPDF

billydekid commented 6 years ago

OK, I will try and update the result here. Thanks,

billydekid commented 6 years ago

I still got error when clicked OCR button:

Exception in thread "defaultAsyncAction5" java.lang.RuntimeException: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 03030024 Failed to perform OCR transformation:
Execution result:
   os:         Linux
   command:    /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+ind /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047_ocr.pdf
   succeeded:  false
   exit code:  1
   out:
   err:        Traceback (most recent call last):
  File "/usr/local/bin/ocrmypdf", line 7, in <module>
    from ocrmypdf.__main__ import run_pipeline
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/__main__.py", line 76, in <module>
    verify_python3_env(
        at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:183)
        at es.keensoft.alfresco.ocr.OCRExtractAction.access$200(OCRExtractAction.java:38)
        at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:164)
        at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:161)
        at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:464)
        at es.keensoft.alfresco.ocr.OCRExtractAction.executeInNewTransaction(OCRExtractAction.java:169)
        at es.keensoft.alfresco.ocr.OCRExtractAction.access$100(OCRExtractAction.java:38)
        at es.keensoft.alfresco.ocr.OCRExtractAction$ExtractOCRTask.run(OCRExtractAction.java:151)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 03030024 Failed to perform OCR transformation:
Execution result:
   os:         Linux
   command:    /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+ind /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047_ocr.pdf
   succeeded:  false
   exit code:  1
   out:
   err:        Traceback (most recent call last):
  File "/usr/local/bin/ocrmypdf", line 7, in <module>
    from ocrmypdf.__main__ import run_pipeline
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/__main__.py", line 76, in <module>
    verify_python3_env(
        at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86)
        at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181)
        ... 10 more
Caused by: org.alfresco.service.cmr.repository.ContentIOException: 03030024 Failed to perform OCR transformation:
Execution result:
   os:         Linux
   command:    /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+ind /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047_ocr.pdf
   succeeded:  false
   exit code:  1
   out:
   err:        Traceback (most recent call last):
  File "/usr/local/bin/ocrmypdf", line 7, in <module>
    from ocrmypdf.__main__ import run_pipeline
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/__main__.py", line 76, in <module>
    verify_python3_env(
        at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:79)
        ... 11 more

Running the command: /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+ind /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047_ocr.pdf

resulted below:

root@alflab:/opt/alfresco-community# /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+ind /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047_ocr.pdf
  DEBUG - ocrmypdf 6.1.3
  DEBUG - tesseract 3.04.01
  DEBUG - qpdf 6.0.0
  DEBUG - PyMuPDF 1.12.5
  DEBUG - libmupdf 1.12.0
WARNING - The 'hocr' PDF renderer is known to cause problems with one or more of the languages in your document. Use --pdf-renderer tesseract --output-type pdf to avoid this issue
You are using qpdf version 6.0.0 which has known issues including security vulnerabilities with certain malformed PDFs. Consider upgrading to version 7.0.0 or newer.
  DEBUG - os.symlink(/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047.pdf, /tmp/com.github.ocrmypdf.3rlj5_8_/origin)

________________________________________
Tasks which will be run:

Task enters queue = 'ocrmypdf.pipeline.triage'
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/origin, /tmp/com.github.ocrmypdf.3rlj5_8_/origin.pdf)
Completed Task = 'ocrmypdf.pipeline.triage'
Task enters queue = 'ocrmypdf.pipeline.repair_and_parse_pdf'
  DEBUG - Beginning qpdf repair...
  DEBUG - Repair OK; beginning parse...
  DEBUG - <PdfInfo('...'), page count=1>
Completed Task = 'ocrmypdf.pipeline.repair_and_parse_pdf'
Task enters queue = 'ocrmypdf.pipeline.pre_split_pages'
Task enters queue = 'ocrmypdf.pipeline.generate_postscript_stub'
Completed Task = 'ocrmypdf.pipeline.pre_split_pages'
Task enters queue = 'ocrmypdf.pipeline.split_page'
Completed Task = 'ocrmypdf.pipeline.generate_postscript_stub'
Completed Task = 'ocrmypdf.pipeline.split_page'
Task enters queue = 'ocrmypdf.pipeline.ocr_or_skip'
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/000001.page.pdf, /tmp/com.github.ocrmypdf.3rlj5_8_/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.pipeline.ocr_or_skip'
Task enters queue = 'ocrmypdf.pipeline.orient_page'
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/000001.ocr.page.pdf, /tmp/com.github.ocrmypdf.3rlj5_8_/000001.ocr.oriented.pdf)
Completed Task = 'ocrmypdf.pipeline.orient_page'
Task enters queue = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.skip_page'
Uptodate Task = 'ocrmypdf.pipeline.skip_page'

WARNING:
        In Task 'ocrmypdf.pipeline.skip_page':
        No jobs were run because no file names matched.
        Please make sure that the regular expression is correctly specified.

  DEBUG - Rasterize 000001.ocr.oriented.pdf with pngmono
  DEBUG -
  DEBUG - Ghostscript: resize output image (2703, 3498) -> (2705, 3501)
Completed Task = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.preprocess_remove_background'
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/000001.page.png, /tmp/com.github.ocrmypdf.3rlj5_8_/000001.pp-background.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_remove_background'
Task enters queue = 'ocrmypdf.pipeline.preprocess_deskew'
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/000001.pp-background.png, /tmp/com.github.ocrmypdf.3rlj5_8_/000001.pp-deskew.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_deskew'
Task enters queue = 'ocrmypdf.pipeline.preprocess_clean'
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.3rlj5_8_/000001.pp-clean.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_clean'
Task enters queue = 'ocrmypdf.pipeline.select_ocr_image'
Task enters queue = 'ocrmypdf.pipeline.select_visible_page_image'
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/000001.pp-clean.png, /tmp/com.github.ocrmypdf.3rlj5_8_/000001.ocr.png)
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/000001.page.png, /tmp/com.github.ocrmypdf.3rlj5_8_/000001.image)
Completed Task = 'ocrmypdf.pipeline.select_ocr_image'
Task enters queue = 'ocrmypdf.pipeline.ocr_tesseract_hocr'
  DEBUG - ['tesseract', '-l', 'eng+ind', '/tmp/com.github.ocrmypdf.3rlj5_8_/000001.ocr.png', '/tmp/com.github.ocrmypdf.3rlj5_8_/000001', 'hocr', 'txt']
Completed Task = 'ocrmypdf.pipeline.select_visible_page_image'
Task enters queue = 'ocrmypdf.pipeline.select_image_layer'
  DEBUG -    1: convert
  DEBUG -    1: convert done
Completed Task = 'ocrmypdf.pipeline.select_image_layer'
Completed Task = 'ocrmypdf.pipeline.ocr_tesseract_hocr'
Task enters queue = 'ocrmypdf.pipeline.render_hocr_page'
Completed Task = 'ocrmypdf.pipeline.render_hocr_page'
Task enters queue = 'ocrmypdf.pipeline.combine_layers'
Completed Task = 'ocrmypdf.pipeline.combine_layers'
Task enters queue = 'ocrmypdf.pipeline.merge_pages_ghostscript'
  DEBUG - Final pages: /tmp/com.github.ocrmypdf.3rlj5_8_/000001.rendered.pdf
/tmp/com.github.ocrmypdf.3rlj5_8_/pdfa.ps
  DEBUG -
Completed Task = 'ocrmypdf.pipeline.merge_pages_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.copy_final'
Completed Task = 'ocrmypdf.pipeline.copy_final'
   INFO - Output file is a PDF/A-2B (as expected)
  DEBUG - <PdfInfo('...'), page count=1>
angelborroy-ks commented 6 years ago

Both looks like problems installing pdfsandwich or ocrmypdf in your operative system.

Install properly one or the other (you can use raw command line to test) before trying the addon.

billydekid commented 6 years ago

After trying running in raw command nothing error found. The pdf output is normal and searchable. Below are two commands I used:

(1) /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+ind /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8611528535680953898.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8611528535680953898_ocr.pdf

(2) In simple format: ocrmypdf -l eng+ind --pdf-renderer tesseract --output-type pdf ImageOnly.pdf hasil.pdf

Both of them produce OCR'd document.

angelborroy-ks commented 6 years ago

If the programs are working from command line, then you have to configure Alfresco in the ways are described at https://github.com/keensoft/alfresco-simple-ocr/wiki/FAQ

ebrenessdev commented 4 years ago

Hello,

I had the same issue and I solved it changing the next properties

img.root=/usr img.dyn=${img.root}/lib img.exe=${img.root}/bin/convert img.gslib=/usr/share/ghostscript/9.26/lib

as you can see I'm not using the "common" applications integrated with alfresco. I'm using the paths of imagemagick and gs in my system.