keensoft / alfresco-simple-ocr

Simple OCR action for Alfresco
Other
44 stars 30 forks source link

Files was not modified #51

Open Goku103 opened 6 years ago

Goku103 commented 6 years ago

Hello,

I've been encountering an issue recently. When I attempt to use the OCR on a PDF file, a message appears saiying "the document will be availabe in minutes" but i can't find any file converted. And the original file was not modified.

I want to kown if someone has ever seen this issue or help me to get the pdf converted.

Thank you in advance.

angelborroy-ks commented 6 years ago

Please include a detailed stacktrace from alfresco.log or catalina.out.

Thanks.

Goku103 commented 6 years ago

In alfresco.log, i don't have errors.

Catalina.out

Exception in thread "defaultAsyncAction1" java.lang.RuntimeException: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 06270022 Failed to perform OCR transformation: Execution result: os: Linux command: /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l spa+eng+fra /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8049813577400375998.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8049813577400375998_ocr.pdf succeeded: false exit code: 1 out: err: Traceback (most recent call last): File "/usr/local/bin/ocrmypdf", line 7, in <module> from ocrmypdf.__main__ import run_pipeline File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/__main__.py", line 70, in <module> verify_python3_env( at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:183) at es.keensoft.alfresco.ocr.OCRExtractAction.access$200(OCRExtractAction.java:38) at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:164) at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:161) at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:464) at es.keensoft.alfresco.ocr.OCRExtractAction.executeInNewTransaction(OCRExtractAction.java:169) at es.keensoft.alfresco.ocr.OCRExtractAction.access$100(OCRExtractAction.java:38) at es.keensoft.alfresco.ocr.OCRExtractAction$ExtractOCRTask.run(OCRExtractAction.java:151) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 06270022 Failed to perform OCR transformation: Execution result: os: Linux command: /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l spa+eng+fra /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8049813577400375998.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8049813577400375998_ocr.pdf succeeded: false exit code: 1 out: err: Traceback (most recent call last): File "/usr/local/bin/ocrmypdf", line 7, in <module> from ocrmypdf.__main__ import run_pipeline File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/__main__.py", line 70, in <module> verify_python3_env( at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86) at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181) ... 10 more Caused by: org.alfresco.service.cmr.repository.ContentIOException: 06270022 Failed to perform OCR transformation: Execution result: os: Linux command: /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l spa+eng+fra /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8049813577400375998.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8049813577400375998_ocr.pdf succeeded: false exit code: 1 out: err: Traceback (most recent call last): File "/usr/local/bin/ocrmypdf", line 7, in <module> from ocrmypdf.__main__ import run_pipeline File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/__main__.py", line 70, in <module> verify_python3_env( at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:79) ... 11 more

angelborroy-ks commented 6 years ago

Probably you can test from command line the transformation that is not working to find more details:

/usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l spa+eng+fra /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8049813577400375998.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8049813577400375998_ocr.pdf

Goku103 commented 6 years ago

Ok, First I have

/usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l spa+eng+fra /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8049813577400375998.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8049813577400375998_ocr.pdf DEBUG - ocrmypdf 7.0.0 DEBUG - tesseract 4.0.0-beta.3-249-g607e DEBUG - qpdf 8.0.2 ERROR - The installed version of tesseract does not have language data for the following requested languages: spa

I launch the command line without "spa" Now I have

`/usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+fra /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8049813577400375998.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8049813577400375998_ocr.pdf DEBUG - ocrmypdf 7.0.0 DEBUG - tesseract 4.0.0-beta.3-249-g607e DEBUG - qpdf 8.0.2 WARNING - The installed version of Ghostscript does not work correctly with the OCR languages you specified. Use --output-type pdf or upgrade to Ghostscript 9.20 or later to avoid this issue.Found Ghostscript 9.18 DEBUG - os.symlink(/opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8049813577400375998.pdf, /tmp/com.github.ocrmypdf.jqqkrft4/origin)


Tasks which will be run:

Task enters queue = 'ocrmypdf._pipeline.triage' DEBUG - os.symlink(/tmp/com.github.ocrmypdf.jqqkrft4/origin, /tmp/com.github.ocrmypdf.jqqkrft4/origin.pdf) Completed Task = 'ocrmypdf._pipeline.triage' Task enters queue = 'ocrmypdf._pipeline.repair_and_parse_pdf' DEBUG - <PdfInfo('...'), page count=1> Completed Task = 'ocrmypdf._pipeline.repair_and_parse_pdf' Task enters queue = 'ocrmypdf._pipeline.marker_pages' Task enters queue = 'ocrmypdf._pipeline.generate_postscript_stub' Completed Task = 'ocrmypdf._pipeline.marker_pages' Task enters queue = 'ocrmypdf._pipeline.ocr_or_skip' Completed Task = 'ocrmypdf._pipeline.generate_postscript_stub' DEBUG - os.symlink(/tmp/com.github.ocrmypdf.jqqkrft4/000001.marker.pdf, /tmp/com.github.ocrmypdf.jqqkrft4/000001.ocr.page.pdf) Completed Task = 'ocrmypdf._pipeline.ocr_or_skip' Task enters queue = 'ocrmypdf._pipeline.orient_page' DEBUG - os.symlink(/tmp/com.github.ocrmypdf.jqqkrft4/000001.ocr.page.pdf, /tmp/com.github.ocrmypdf.jqqkrft4/000001.ocr.oriented.pdf) Completed Task = 'ocrmypdf._pipeline.orient_page' Task enters queue = 'ocrmypdf._pipeline.rasterize_with_ghostscript' DEBUG - Rasterize 000001.ocr.oriented.pdf with pngmono DEBUG - ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pngmono', '-dFirstPage=1', '-dLastPage=1', '-r600x600', '-o', '/tmp/tmpanbb5z_n', '-dAutoRotatePages=/None', '-f', '/tmp/com.github.ocrmypdf.jqqkrft4/000001.ocr.oriented.pdf'] DEBUG - DEBUG - Rotating output by 0 Completed Task = 'ocrmypdf._pipeline.rasterize_with_ghostscript' Task enters queue = 'ocrmypdf._pipeline.preprocess_remove_background' DEBUG - os.symlink(/tmp/com.github.ocrmypdf.jqqkrft4/000001.page.png, /tmp/com.github.ocrmypdf.jqqkrft4/000001.pp-background.png) Completed Task = 'ocrmypdf._pipeline.preprocess_remove_background' Task enters queue = 'ocrmypdf._pipeline.preprocess_deskew' DEBUG - os.symlink(/tmp/com.github.ocrmypdf.jqqkrft4/000001.pp-background.png, /tmp/com.github.ocrmypdf.jqqkrft4/000001.pp-deskew.png) Completed Task = 'ocrmypdf._pipeline.preprocess_deskew' Task enters queue = 'ocrmypdf._pipeline.preprocess_clean' DEBUG - os.symlink(/tmp/com.github.ocrmypdf.jqqkrft4/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.jqqkrft4/000001.pp-clean.png) Completed Task = 'ocrmypdf._pipeline.preprocess_clean' Task enters queue = 'ocrmypdf._pipeline.select_ocr_image' Task enters queue = 'ocrmypdf._pipeline.select_visible_page_image' DEBUG - os.symlink(/tmp/com.github.ocrmypdf.jqqkrft4/000001.pp-clean.png, /tmp/com.github.ocrmypdf.jqqkrft4/000001.ocr.png) Completed Task = 'ocrmypdf._pipeline.select_ocr_image' Task enters queue = 'ocrmypdf._pipeline.ocr_tesseract_textonly_pdf' DEBUG - os.symlink(/tmp/com.github.ocrmypdf.jqqkrft4/000001.page.png, /tmp/com.github.ocrmypdf.jqqkrft4/000001.image) Completed Task = 'ocrmypdf._pipeline.select_visible_page_image' Task enters queue = 'ocrmypdf._pipeline.select_image_layer' DEBUG - 1: convert DEBUG - 1: convert done DEBUG - ['tesseract', '-l', 'eng+fra', '-c', 'textonly_pdf=1', '/tmp/com.github.ocrmypdf.jqqkrft4/000001.ocr.png', '/tmp/com.github.ocrmypdf.jqqkrft4/000001.text', 'pdf', 'txt'] Completed Task = 'ocrmypdf._pipeline.select_image_layer' Completed Task = 'ocrmypdf._pipeline.ocr_tesseract_textonly_pdf' Task enters queue = 'ocrmypdf._weave.weave_layers' DEBUG - 1 DEBUG - ['/tmp/com.github.ocrmypdf.jqqkrft4/000001.image-layer.pdf', '/tmp/com.github.ocrmypdf.jqqkrft4/000001.text.pdf', '/tmp/com.github.ocrmypdf.jqqkrft4/000001.text.txt'] DEBUG - Replace DEBUG - [0, 0, 0, 0] DEBUG - Grafting DEBUG - (0.9999522656985343, 0.9999523775592609) Completed Task = 'ocrmypdf._weave.weave_layers' Task enters queue = 'ocrmypdf._pipeline.metadata_fixup' DEBUG - ['gs', '-dQUIET', '-dBATCH', '-dNOPAUSE', '-dCompatibilityLevel=1.6', '-dNumRenderingThreads=2', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=/RGB', '-sProcessColorModel=DeviceRGB', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-sOutputFile=/tmp/tmpxkb793zf', '/tmp/com.github.ocrmypdf.jqqkrft4/layers.rendered.pdf', '/tmp/com.github.ocrmypdf.jqqkrft4/pdfa.ps'] DEBUG - Completed Task = 'ocrmypdf._pipeline.metadata_fixup' Task enters queue = 'ocrmypdf._pipeline.optimize_pdf' DEBUG - Optimizable images: JBIG2 groups: 0 JPEGs: 0 PNGs: 1 Errors: 0 INFO - Optimize ratio: 1.00 savings: -0.0% INFO - Optimize did not improve the file - discarded DEBUG - os.symlink(/tmp/com.github.ocrmypdf.jqqkrft4/metafix.pdf, /tmp/com.github.ocrmypdf.jqqkrft4/metafix.optimized.pdf) Completed Task = 'ocrmypdf._pipeline.optimize_pdf' Task enters queue = 'ocrmypdf._pipeline.copy_final' DEBUG - /tmp/com.github.ocrmypdf.jqqkrft4/metafix.optimized.pdf -> /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_8049813577400375998_ocr.pdf Completed Task = 'ocrmypdf._pipeline.copy_final' INFO - Output file is a PDF/A-2B (as expected) DEBUG - <PdfInfo('...'), page count=1>`

Goku103 commented 6 years ago

Now, i can found my file OCR if I download directly in SSH to my server. But I must launch the command line manually to transform my file. On the web interface, the file is not transformed.

Sorry for my english

angelborroy-ks commented 6 years ago

So you have to include/exclude the missing options ("spa" and so on) in your alfresco-global.properties and it's done.

Goku103 commented 6 years ago

Yes but when i click on the button OCR to the web interface of Alfresco, I have nothing. In catalina.out, i have :

Execution result: os: Linux command: /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+fra /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3205329177977384766.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3205329177977384766_ocr.pdf succeeded: false exit code: 1 out: err: Traceback (most recent call last): File "/usr/local/bin/ocrmypdf", line 7, in <module> from ocrmypdf.__main__ import run_pipeline File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/__main__.py", line 70, in <module> verify_python3_env( at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86) at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181) ... 10 more Caused by: org.alfresco.service.cmr.repository.ContentIOException: 06270065 Failed to perform OCR transformation: Execution result: os: Linux command: /usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+fra /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3205329177977384766.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3205329177977384766_ocr.pdf succeeded: false exit code: 1 out: err: Traceback (most recent call last): File "/usr/local/bin/ocrmypdf", line 7, in <module> from ocrmypdf.__main__ import run_pipeline File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/__main__.py", line 70, in <module> verify_python3_env( at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:79) ... 11 more

If i launch the command line manually

/usr/local/bin/ocrmypdf --verbose 1 --force-ocr -l eng+fra /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3205329177977384766.pdf /opt/alfresco-community/tomcat/temp/Alfresco/OCRTransformWorker_source_3205329177977384766_ocr.pdf

It's work but i have nothing in Alfresco.

angelborroy-ks commented 6 years ago

Then probably is an environment problem.

Check the FAQ section.

Goku103 commented 6 years ago

My environment ? Normally, when you click on the OCR button, the file changes automatically ?

gtnieto commented 4 years ago

I have the same problem that Goku103 but im working with pdfsandwich. Whe i run the conversion by command line, i guet the extra file changed. But when i order the convertion into Alfresco, the same message its received, but no new file its create. In fact, nothing happend.

I have to say that my implementation of Alfresco its in Ubuntu. Maybe the sintax on properties should be different on this part?? ocr.server.os=linux