keensoft / alfresco-simple-ocr

Simple OCR action for Alfresco
Other
44 stars 30 forks source link

Problem writing OCRed file #5

Closed mquarto closed 8 years ago

mquarto commented 8 years ago

Hello, very nice tool to have but I cannot seem to make it work. It produces the ocr file (through pdfsandwich in my case) but cannot write final result and place it in Alfresco repository. Below extract from alfresco.log file, maybe you can help? Thanks in advance Mat

2016-04-07 17:08:56,234 WARN [es.keensoft.alfresco.ocr.OCRExtractAction] [defaultAsyncAction3] org.alfresco.service.cmr.repository.ContentIOException: 03070084 Failed to copy reader content to writer: writer: ContentAccessor[ contentUrl=store://2016/4/7/17/8/da25d86b-dbc1-4b95-ba10-4d2daaef7b17.bin, mimetype=application/pdf, size=994256, encoding=UTF-8, locale=it] source reader: ContentAccessor[ contentUrl=store://2016/4/7/17/6/4a37ad34-b135-49cb-b316-9c7386eaf5bd.bin, mimetype=application/pdf, size=994256, encoding=UTF-8, locale=en_US] org.alfresco.service.cmr.repository.ContentIOException: 03070084 Failed to copy reader content to writer: writer: ContentAccessor[ contentUrl=store://2016/4/7/17/8/da25d86b-dbc1-4b95-ba10-4d2daaef7b17.bin, mimetype=application/pdf, size=994256, encoding=UTF-8, locale=it] source reader: ContentAccessor[ contentUrl=store://2016/4/7/17/6/4a37ad34-b135-49cb-b316-9c7386eaf5bd.bin, mimetype=application/pdf, size=994256, encoding=UTF-8, locale=en_US]

angelborroy-ks commented 8 years ago

Thanks for your interest, Mat.

Can you provide more log trace?

It looks like your pdfsandwich is not working when launched inside Alfresco. Probably because you used wizard to install Alfresco and pdfsandwich is picking a lower version of gs than required.

mquarto commented 8 years ago

Gracias Angel por contestar pronto! here goes attached bigger extract from same log file.

It's true what you say, I used wizard to install Alfresco, pdfsandwich works correctly when launched separately How can I change gs version? sorry to ask...

alfrescolog.ocr.txt

angelborroy-ks commented 8 years ago

When pdfsandwich is working properly, following lines are appearing on the log:

Checking for gs:
gs -v
GPL Ghostscript 9.10 (2013-08-30)

Your installation is picking Ghostscript 8.64 and pdfsandwich fails.

You can check Alfresco Documentation to install additional software, there is an specific section for Ghostscript.

mquarto commented 8 years ago

on my linux I have /usr/share/ghostscript/9.10/lib installed (ghostscript -version, or gs -v, returns GPL Ghostscript 9.10 (2013-08-30) ) so I edited alfresco-global.properties inserting the line img.gslib = /usr/share/ghostscript/9.10/lib and restarted the service

the resulting log file when uploading a pdf file is attached still no success, alfresco seems to load old gs version yet it does not give the same error as before but I cannot see OCRed pdf anything else I'm missing?

thanks or your help

alfresco.log.txt

angelborroy-ks commented 8 years ago

gs 8.64is still selected by Alfresco when launching pdfsandwich program.

How about creating a shell script to set PATH before invoking pdfsandwich? You can include your /usr/share/ghostscript/9.10/lib before any other one and it should work. Remember to change parameter in alfresco-global.properties to include this new script instead of pdfsandwich.

angelborroy-ks commented 8 years ago

My steps to solve your problem:

export PATH=/usr/bin:$PATH
pdfsandwich $@
chmod +x ocr.sh
# img.root=/home/alfresco/alfresco-community/common
# img.dyn=${img.root}/lib
# img.exe=${img.root}/bin/convert

img.root=/usr/share/doc/imagemagick
img.exe=/usr/bin/convert
img.config=${img.root}
img.coders=/usr/lib/x86_64-linux-gnu/ImageMagick-6.7.7/modules-Q16/coders
img.dyn=/usr/local/lib
img.gslib=/usr/local/lib
swf.exe=/usr/bin/pdf2swf

# OCR
ocr.command=/home/alfresco/alfresco-community/scripts/ocr.sh
ocr.output.verbose=true
ocr.output.file.prefix.command=-o

ocr.extra.commands=-verbose -lang spa+eng+fra
ocr.server.os=linux

After these steps the addon has started to work. Maybe it's not so clean, but it can be used in the meanwhile.

mquarto commented 8 years ago

same steps here (I think) but different result (no success)

export PATH=/usr/share/ghostscript/9.10/lib:$PATH
pdfsandwich $@
img.gslib = /usr/share/ghostscript/9.10/lib 
# OCR
ocr.command=/opt/alfresco/scripts/ocr.sh
ocr.output.verbose=true
ocr.output.file.prefix.command=-o
ocr.extra.commands=-verbose -lang ita
ocr.server.os=linux

The pdf is still not OCRed and the attached log file shows gs 8.64 is still loaded instead of 9.1 alfresco.log.txt

What I am doing wrong?

angelborroy-ks commented 8 years ago

You must copy-paste my values, img.gslib shall be /usr/local/lib

mquarto commented 8 years ago

ok, done problem not solved extract from log file attached alfresco.log.txt

angelborroy-ks commented 8 years ago

You are missing some step or you are using incorrect values in any of your properties files: if you copy-paste all the above configurations, it works.

mquarto commented 8 years ago

ok I'll start from scratch and try again can you confirm the script has to be

export PATH=/usr/bin:$PATH
pdfsandwich $@

and not

export PATH=/usr/share/ghostscript/9.10/lib:$PATH
pdfsandwich $@
angelborroy-ks commented 8 years ago

Yes, I can confirm that.

jk-ots commented 8 years ago

If you are running into these troubles reg. dependencies between any "external" software called and alfresco, and you are not able or willing to install from scratch, you can isolate the call from alfresco to the external package by using a script like mentioned above, BUT spawn a new user-session from within the script, which has the correct initialization for the external software. So you can check and setup the external package independantly from alfesco and then just call it via script as if you would do from a normal commandline. You simply do that by building a script like this: #!/bin/bash /bin/su -l -c "/comand2execute -options $@" This is in case you use root as user. Otherwise you have to specify the user you want to spawn the session for (after /bin/su -l) and if you are running alfresco not as root you have to ensure that the credentials are specified, encrypted or not necessary - as you like. Don't forget $@ at the end because here all params are forwarded coming from the caller (i. e. further options, source and destination file).

I would recommend to specify all your options in this script and NOT in the alfresco-global.properties of this add-on, because

mquarto commented 8 years ago

I'm probably one little step ahead now I can see an OCRed pdf in alf_data/contentstore subdir and I can see from log file that ghostscript 9.1 is loaded but the OCRed file is not loaded in Alfresco repository in place of the original one

(btw I have rebuilt everything from scratch and followed your instructions) alfresco-global.properties.txt

"Failed to copy reader content to writer:" seems to be the issue

thanks for your patience alfresco.log.txt

angelborroy-ks commented 8 years ago

Really is hard to identify your problem.

I'd suggest you to try with jk-ots instructions described in this thread or to install "ocrmypdf" instead of "pdfsandwich".

mquarto commented 8 years ago

I understand, and don't mean to insist but what I now read in the log and find in the content folder shows that it's not a conversion or OCR issue (that happens nicely) It seems about writing output OCRed file to alfresco repository; this raises an exception from es.keensoft.alfresco.ocr.OCRExtractAction

this is the log portion I'm talking about the files listed in the exception log are in place and correct (original and OCRed)

2016-04-08 15:34:16,206 WARN  [es.keensoft.alfresco.ocr.OCRExtractAction] [defaultAsyncAction4] org.alfresco.service.cmr.repository.ContentIOException: 03080008 Failed to copy reader content to writer: 
   writer: ContentAccessor[ contentUrl=store://2016/4/8/15/34/79f11cf3-abd5-4529-a2d0-51c9bda80a47.bin, mimetype=application/pdf, size=144973, encoding=UTF-8, locale=it]
   source reader: ContentAccessor[ contentUrl=store://2016/4/8/15/33/2235d95c-8fb5-4a7d-a630-8ac67f39c286.bin, mimetype=application/pdf, size=144973, encoding=UTF-8, locale=it_IT]
org.alfresco.service.cmr.repository.ContentIOException: 03080008 Failed to copy reader content to writer: 
   writer: ContentAccessor[ contentUrl=store://2016/4/8/15/34/79f11cf3-abd5-4529-a2d0-51c9bda80a47.bin, mimetype=application/pdf, size=144973, encoding=UTF-8, locale=it]
   source reader: ContentAccessor[ contentUrl=store://2016/4/8/15/33/2235d95c-8fb5-4a7d-a630-8ac67f39c286.bin, mimetype=application/pdf, size=144973, encoding=UTF-8, locale=it_IT]
mquarto commented 8 years ago

nop, sorry, I was wrong original and OCRed files are there but there is a file (re)naming issue

the Extract OCR rule is trying to read this file store://2016/4/8/15/33/2235d95c-8fb5-4a7d-a630-8ac67f39c286.bin which does not exists ; the real source filename is instead store://2016/4/8/15/33/4e914027-5c78-47e6-8c35-8656c11c62a2.bin

I reckon there is a step within the rule execution where filename is/should be changed but something goes wrong?

Thanks again

andreklug commented 8 years ago

Hi, looks like I am running into exactly the same problem. files.zip

I have tried OCRmyPDF as well, but this is failing with another error (some lib is missing version info from python3), let me know if you need details for this as well.

Thank you!

angelborroy-ks commented 8 years ago

Caused by: org.springframework.dao.ConcurrencyFailureException: Failed to update node 34403

How are you configuring the rule? Is your PDF too short?

It will work if you mark "Run rule in background" option.

andreklug commented 8 years ago

It is marked to run in the background. Source is a normal A4 1-page PDF.

angelborroy-ks commented 8 years ago

And it works if you unset "background"?

andreklug commented 8 years ago

yes it does, although adding multiple files gets quite inconvenient this way. do you have an idea why this is depending on background or not?

angelborroy-ks commented 8 years ago

It looks that your system is too slow and OCR is trying to update the content in rendition phase for your new contents.

We'll try to provide a new release to cover this case.

andreklug commented 8 years ago

This might be correct. The system is a VM with 2 cores / 4GB.

angelborroy-ks commented 8 years ago

Thanks to mquarto and andreklug for your collaboration. Finally, we have find the reason of that weird behaviour. Please, upgrade your AMP to version 1.0.2 and let me know if it works for you.

andreklug commented 8 years ago

thank you angelborroy-ks for updating the repo. Unfortunately, still no success. I have verified that the 1.0.2 repo is installed and rebooted server (just in case). The log is attached. alfresco.log.zip

angelborroy-ks commented 8 years ago

It seems that your system is not properly updated.

Are you using this AMP? https://github.com/keensoft/alfresco-simple-ocr/releases/download/1.0.2/simple-ocr-repo.amp

Can you try with a clean Alfresco installation?

Thanks

massimone73 commented 8 years ago

Hello everyone I also had trouble configuring simple-ocr, but after a whole day of work and more than a hundred reboots Alfresco, I managed to make this extraordinary add-ons.

I state that my version of Alfresco Community is 5.1.e that runs on Debian 8 64Bit with 4 vCPUs and 4GB of RAM.

Following the installation procedure:

apt-get update

apt-get upgrade

apt-get install zlib1g-dev libjpeg-dev libffi-dev ghostscript tesseract-ocr tesseract-ocr-ita qpdf unpaper python3-pip python3-pil python3-pytest python3-reportlab ocaml imagemagick exactimage

pip3 install ocrmypdf

cd /opt

wget http://downloads.sourceforge.net/project/pdfsandwich/pdfsandwich%200.1.4/pdfsandwich_0.1.4_amd64.deb

dpkg -i pdfsandwich_0.1.4_amd64.deb

apt-get -fy install

/opt/alfresco/alfresco.sh stop

cd /opt/alfresco/amps

wget https://github.com/keensoft/alfresco-simple-ocr/releases/download/1.0.2/simple-ocr-repo.amp

apt-get install default-jre

cd /opt/alfresco/bin

java -jar alfresco-mmt.jar install /opt/alfresco/amps/simple-ocr-repo.amp /opt/alfresco/tomcat/webapps/alfresco.war -verbose

/opt/alfresco/bin/apply_amps.sh

nano /opt/alfresco/scripts/ocr.sh

export PATH=/usr/bin:$PATH pdfsandwich $@

chmod +x /opt/alfresco/scripts/ocr.sh

nano /opt/alfresco/tomcat/shared/classes/alfresco-global.properties

Comment the following lines

img.root=/opt/alfresco/common

img.dyn=${img.root}/lib

img.exe=${img.root}/bin/convert

img.root=/usr/share/doc/imagemagick img.exe=/usr/bin/convert img.config=${img.root} img.coders=/usr/lib/x86_64-linux-gnu/ImageMagick-6.8.9/modules-Q16/coders img.dyn=/usr/share/ghostscript/9.06/lib img.gslib=/usr/share/ghostscript/9.06/lib

OCR WITH PDFSANDWITCH

ocr.command=/opt/alfresco/scripts/ocr.sh ocr.output.verbose=true ocr.output.file.prefix.command=-o ocr.extra.commands=-verbose -lang ita+eng ocr.server.os=linux

/opt/alfresco/alfresco.sh start

Create a folder in the root of the "My Files" on which to send the files to be processed by OCR, for example: Scanner

Create a rule on the Scanner folder.

angelborroy-ks commented 8 years ago

Thanks for your contribution massimone73.

Can you confirm if both synchronous and asynchronous rules are working with 1.0.2 release?

On the other hand, should be useful to package your script as a Dockerfile for others?

Thanks

massimone73 commented 8 years ago

I'm sorry. I do not know how to check. Can you please show me how to verify you are referring to.

angelborroy-ks commented 8 years ago

When you define a new rule, you can check or uncheck the option "Run in background".

When checked, the action is executed after you upload or change the document. When the action is unchecked, OCR is launched before your uploading, in background.

This issue starts talking about problems with Alfresco installation, but in the last threads a new issue is solved: the addon didn't work in background under some circumstances.

Thanks again.

massimone73 commented 8 years ago

I confirm. Both methods work perfectly. After you have enabled the 'Run in background' option, I loaded a .tif file and uploading was a breeze. After about 30 seconds I refreshed the page and the same document, this also appeared in pdf. Thank you.

massimone73 commented 8 years ago

Could you explain how to index all, or nearly all, the words of a document, drawn with simple-ocr? After loading a .tif file, simple-ocr generates another file in pdf. If I try to open the pdf file you just worked with ocr, I can highlight and copy correctly, almost all the words of the document, but if I try to use these words to search the document with the search file function of Alfresco, just some I return the pdf file.

angelborroy-ks commented 8 years ago

It should work by default. Try to upload the PDF ocrd file to another folder, does SOLR work finding contents for this new file? If not, maybe some index problem has to be solved.

massimone73 commented 8 years ago

Solved. The problem was due to the quality of the document and the scanner. Today I tried with the multifunction installed in my company and indexing is perfect. Thanks.

mquarto commented 8 years ago

Thank you angelborroy-ks for the update 1.0.2 and massimone73 for detailed instructions It now works perfectly and it's a very useful add-on!

angelborroy-ks commented 8 years ago

Thanks to everyone.

Should we close this issue or it's still something not solved?

mquarto commented 8 years ago

I think you should close it, thank you.