Closed billydekid closed 6 years ago
You can try to switch to OCRmyPDF
OK, I will try and update the result here. Thanks,
I still got error when clicked OCR button:
Exception in thread "defaultAsyncAction5" java.lang.RuntimeException: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 03030024 Failed to perform OCR transformation:
Execution result:
os: Linux
command: /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+ind /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/local/bin/ocrmypdf", line 7, in <module>
from ocrmypdf.__main__ import run_pipeline
File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/__main__.py", line 76, in <module>
verify_python3_env(
at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:183)
at es.keensoft.alfresco.ocr.OCRExtractAction.access$200(OCRExtractAction.java:38)
at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:164)
at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:161)
at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:464)
at es.keensoft.alfresco.ocr.OCRExtractAction.executeInNewTransaction(OCRExtractAction.java:169)
at es.keensoft.alfresco.ocr.OCRExtractAction.access$100(OCRExtractAction.java:38)
at es.keensoft.alfresco.ocr.OCRExtractAction$ExtractOCRTask.run(OCRExtractAction.java:151)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 03030024 Failed to perform OCR transformation:
Execution result:
os: Linux
command: /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+ind /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/local/bin/ocrmypdf", line 7, in <module>
from ocrmypdf.__main__ import run_pipeline
File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/__main__.py", line 76, in <module>
verify_python3_env(
at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86)
at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181)
... 10 more
Caused by: org.alfresco.service.cmr.repository.ContentIOException: 03030024 Failed to perform OCR transformation:
Execution result:
os: Linux
command: /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+ind /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/local/bin/ocrmypdf", line 7, in <module>
from ocrmypdf.__main__ import run_pipeline
File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/__main__.py", line 76, in <module>
verify_python3_env(
at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:79)
... 11 more
Running the command:
/usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+ind /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047_ocr.pdf
resulted below:
root@alflab:/opt/alfresco-community# /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+ind /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047_ocr.pdf
DEBUG - ocrmypdf 6.1.3
DEBUG - tesseract 3.04.01
DEBUG - qpdf 6.0.0
DEBUG - PyMuPDF 1.12.5
DEBUG - libmupdf 1.12.0
WARNING - The 'hocr' PDF renderer is known to cause problems with one or more of the languages in your document. Use --pdf-renderer tesseract --output-type pdf to avoid this issue
You are using qpdf version 6.0.0 which has known issues including security vulnerabilities with certain malformed PDFs. Consider upgrading to version 7.0.0 or newer.
DEBUG - os.symlink(/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_1542116057064531047.pdf, /tmp/com.github.ocrmypdf.3rlj5_8_/origin)
________________________________________
Tasks which will be run:
Task enters queue = 'ocrmypdf.pipeline.triage'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/origin, /tmp/com.github.ocrmypdf.3rlj5_8_/origin.pdf)
Completed Task = 'ocrmypdf.pipeline.triage'
Task enters queue = 'ocrmypdf.pipeline.repair_and_parse_pdf'
DEBUG - Beginning qpdf repair...
DEBUG - Repair OK; beginning parse...
DEBUG - <PdfInfo('...'), page count=1>
Completed Task = 'ocrmypdf.pipeline.repair_and_parse_pdf'
Task enters queue = 'ocrmypdf.pipeline.pre_split_pages'
Task enters queue = 'ocrmypdf.pipeline.generate_postscript_stub'
Completed Task = 'ocrmypdf.pipeline.pre_split_pages'
Task enters queue = 'ocrmypdf.pipeline.split_page'
Completed Task = 'ocrmypdf.pipeline.generate_postscript_stub'
Completed Task = 'ocrmypdf.pipeline.split_page'
Task enters queue = 'ocrmypdf.pipeline.ocr_or_skip'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/000001.page.pdf, /tmp/com.github.ocrmypdf.3rlj5_8_/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.pipeline.ocr_or_skip'
Task enters queue = 'ocrmypdf.pipeline.orient_page'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/000001.ocr.page.pdf, /tmp/com.github.ocrmypdf.3rlj5_8_/000001.ocr.oriented.pdf)
Completed Task = 'ocrmypdf.pipeline.orient_page'
Task enters queue = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.skip_page'
Uptodate Task = 'ocrmypdf.pipeline.skip_page'
WARNING:
In Task 'ocrmypdf.pipeline.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.
DEBUG - Rasterize 000001.ocr.oriented.pdf with pngmono
DEBUG -
DEBUG - Ghostscript: resize output image (2703, 3498) -> (2705, 3501)
Completed Task = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.preprocess_remove_background'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/000001.page.png, /tmp/com.github.ocrmypdf.3rlj5_8_/000001.pp-background.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_remove_background'
Task enters queue = 'ocrmypdf.pipeline.preprocess_deskew'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/000001.pp-background.png, /tmp/com.github.ocrmypdf.3rlj5_8_/000001.pp-deskew.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_deskew'
Task enters queue = 'ocrmypdf.pipeline.preprocess_clean'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.3rlj5_8_/000001.pp-clean.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_clean'
Task enters queue = 'ocrmypdf.pipeline.select_ocr_image'
Task enters queue = 'ocrmypdf.pipeline.select_visible_page_image'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/000001.pp-clean.png, /tmp/com.github.ocrmypdf.3rlj5_8_/000001.ocr.png)
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.3rlj5_8_/000001.page.png, /tmp/com.github.ocrmypdf.3rlj5_8_/000001.image)
Completed Task = 'ocrmypdf.pipeline.select_ocr_image'
Task enters queue = 'ocrmypdf.pipeline.ocr_tesseract_hocr'
DEBUG - ['tesseract', '-l', 'eng+ind', '/tmp/com.github.ocrmypdf.3rlj5_8_/000001.ocr.png', '/tmp/com.github.ocrmypdf.3rlj5_8_/000001', 'hocr', 'txt']
Completed Task = 'ocrmypdf.pipeline.select_visible_page_image'
Task enters queue = 'ocrmypdf.pipeline.select_image_layer'
DEBUG - 1: convert
DEBUG - 1: convert done
Completed Task = 'ocrmypdf.pipeline.select_image_layer'
Completed Task = 'ocrmypdf.pipeline.ocr_tesseract_hocr'
Task enters queue = 'ocrmypdf.pipeline.render_hocr_page'
Completed Task = 'ocrmypdf.pipeline.render_hocr_page'
Task enters queue = 'ocrmypdf.pipeline.combine_layers'
Completed Task = 'ocrmypdf.pipeline.combine_layers'
Task enters queue = 'ocrmypdf.pipeline.merge_pages_ghostscript'
DEBUG - Final pages: /tmp/com.github.ocrmypdf.3rlj5_8_/000001.rendered.pdf
/tmp/com.github.ocrmypdf.3rlj5_8_/pdfa.ps
DEBUG -
Completed Task = 'ocrmypdf.pipeline.merge_pages_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.copy_final'
Completed Task = 'ocrmypdf.pipeline.copy_final'
INFO - Output file is a PDF/A-2B (as expected)
DEBUG - <PdfInfo('...'), page count=1>
Both looks like problems installing pdfsandwich or ocrmypdf in your operative system.
Install properly one or the other (you can use raw command line to test) before trying the addon.
After trying running in raw command nothing error found. The pdf output is normal and searchable. Below are two commands I used:
(1) /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+ind /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8611528535680953898.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8611528535680953898_ocr.pdf
(2) In simple format: ocrmypdf -l eng+ind --pdf-renderer tesseract --output-type pdf ImageOnly.pdf hasil.pdf
Both of them produce OCR'd document.
If the programs are working from command line, then you have to configure Alfresco in the ways are described at https://github.com/keensoft/alfresco-simple-ocr/wiki/FAQ
Hello,
I had the same issue and I solved it changing the next properties
img.root=/usr img.dyn=${img.root}/lib img.exe=${img.root}/bin/convert img.gslib=/usr/share/ghostscript/9.26/lib
as you can see I'm not using the "common" applications integrated with alfresco. I'm using the paths of imagemagick and gs in my system.
Hi, I have an issue after installing latest simple-ocr 2.3.1 with pdfsandwich ocr engine, Alfresco 5.2.0 and Ubuntu 16.04 LTS. All supporting apps installed with apt-get / dpkg method of installation as follows:
TESSERACT:
PDFSANDWICH:
Error when try to ocr document by clicking OCR button on document page:
Following libtiff on server I found:
It seems Ubuntu's leptonica is not match with Alfresco libtiff version. CMIIW.
How to fix this error?
Thank you, [bayu]