Closed mquarto closed 8 years ago
Thanks for your interest, Mat.
Can you provide more log trace?
It looks like your pdfsandwich
is not working when launched inside Alfresco. Probably because you used wizard to install Alfresco and pdfsandwich
is picking a lower version of gs
than required.
Gracias Angel por contestar pronto! here goes attached bigger extract from same log file.
It's true what you say, I used wizard to install Alfresco, pdfsandwich works correctly when launched separately How can I change gs version? sorry to ask...
When pdfsandwich
is working properly, following lines are appearing on the log:
Checking for gs:
gs -v
GPL Ghostscript 9.10 (2013-08-30)
Your installation is picking Ghostscript 8.64 and pdfsandwich
fails.
You can check Alfresco Documentation to install additional software, there is an specific section for Ghostscript.
on my linux I have /usr/share/ghostscript/9.10/lib installed (ghostscript -version, or gs -v, returns GPL Ghostscript 9.10 (2013-08-30) ) so I edited alfresco-global.properties inserting the line img.gslib = /usr/share/ghostscript/9.10/lib and restarted the service
the resulting log file when uploading a pdf file is attached still no success, alfresco seems to load old gs version yet it does not give the same error as before but I cannot see OCRed pdf anything else I'm missing?
thanks or your help
gs 8.64
is still selected by Alfresco when launching pdfsandwich
program.
How about creating a shell script to set PATH before invoking pdfsandwich
? You can include your /usr/share/ghostscript/9.10/lib
before any other one and it should work. Remember to change parameter in alfresco-global.properties
to include this new script instead of pdfsandwich
.
My steps to solve your problem:
ocr.sh
export PATH=/usr/bin:$PATH
pdfsandwich $@
chmod +x ocr.sh
alfresco-global.properties
# img.root=/home/alfresco/alfresco-community/common
# img.dyn=${img.root}/lib
# img.exe=${img.root}/bin/convert
img.root=/usr/share/doc/imagemagick
img.exe=/usr/bin/convert
img.config=${img.root}
img.coders=/usr/lib/x86_64-linux-gnu/ImageMagick-6.7.7/modules-Q16/coders
img.dyn=/usr/local/lib
img.gslib=/usr/local/lib
swf.exe=/usr/bin/pdf2swf
# OCR
ocr.command=/home/alfresco/alfresco-community/scripts/ocr.sh
ocr.output.verbose=true
ocr.output.file.prefix.command=-o
ocr.extra.commands=-verbose -lang spa+eng+fra
ocr.server.os=linux
After these steps the addon has started to work. Maybe it's not so clean, but it can be used in the meanwhile.
same steps here (I think) but different result (no success)
export PATH=/usr/share/ghostscript/9.10/lib:$PATH
pdfsandwich $@
alfresco-global.properties
img.gslib = /usr/share/ghostscript/9.10/lib
# OCR
ocr.command=/opt/alfresco/scripts/ocr.sh
ocr.output.verbose=true
ocr.output.file.prefix.command=-o
ocr.extra.commands=-verbose -lang ita
ocr.server.os=linux
The pdf is still not OCRed and the attached log file shows gs 8.64 is still loaded instead of 9.1 alfresco.log.txt
What I am doing wrong?
You must copy-paste my values, img.gslib
shall be /usr/local/lib
ok, done problem not solved extract from log file attached alfresco.log.txt
You are missing some step or you are using incorrect values in any of your properties files: if you copy-paste all the above configurations, it works.
ok I'll start from scratch and try again can you confirm the script has to be
export PATH=/usr/bin:$PATH
pdfsandwich $@
and not
export PATH=/usr/share/ghostscript/9.10/lib:$PATH
pdfsandwich $@
Yes, I can confirm that.
If you are running into these troubles reg. dependencies between any "external" software called and alfresco, and you are not able or willing to install from scratch, you can isolate the call from alfresco to the external package by using a script like mentioned above, BUT spawn a new user-session from within the script, which has the correct initialization for the external software. So you can check and setup the external package independantly from alfesco and then just call it via script as if you would do from a normal commandline.
You simply do that by building a script like this:
#!/bin/bash
/bin/su -l -c "/comand2execute -options $@"
This is in case you use root as user. Otherwise you have to specify the user you want to spawn the session for (after /bin/su -l
) and if you are running alfresco not as root you have to ensure that the credentials are specified, encrypted or not necessary - as you like.
Don't forget $@
at the end because here all params are forwarded coming from the caller (i. e. further options, source and destination file).
I would recommend to specify all your options in this script and NOT in the alfresco-global.properties of this add-on, because
I'm probably one little step ahead now I can see an OCRed pdf in alf_data/contentstore subdir and I can see from log file that ghostscript 9.1 is loaded but the OCRed file is not loaded in Alfresco repository in place of the original one
(btw I have rebuilt everything from scratch and followed your instructions) alfresco-global.properties.txt
"Failed to copy reader content to writer:" seems to be the issue
thanks for your patience alfresco.log.txt
Really is hard to identify your problem.
I'd suggest you to try with jk-ots instructions described in this thread or to install "ocrmypdf" instead of "pdfsandwich".
I understand, and don't mean to insist
but what I now read in the log and find in the content folder shows that
it's not a conversion or OCR issue (that happens nicely)
It seems about writing output OCRed file to alfresco repository;
this raises an exception from es.keensoft.alfresco.ocr.OCRExtractAction
this is the log portion I'm talking about the files listed in the exception log are in place and correct (original and OCRed)
2016-04-08 15:34:16,206 WARN [es.keensoft.alfresco.ocr.OCRExtractAction] [defaultAsyncAction4] org.alfresco.service.cmr.repository.ContentIOException: 03080008 Failed to copy reader content to writer:
writer: ContentAccessor[ contentUrl=store://2016/4/8/15/34/79f11cf3-abd5-4529-a2d0-51c9bda80a47.bin, mimetype=application/pdf, size=144973, encoding=UTF-8, locale=it]
source reader: ContentAccessor[ contentUrl=store://2016/4/8/15/33/2235d95c-8fb5-4a7d-a630-8ac67f39c286.bin, mimetype=application/pdf, size=144973, encoding=UTF-8, locale=it_IT]
org.alfresco.service.cmr.repository.ContentIOException: 03080008 Failed to copy reader content to writer:
writer: ContentAccessor[ contentUrl=store://2016/4/8/15/34/79f11cf3-abd5-4529-a2d0-51c9bda80a47.bin, mimetype=application/pdf, size=144973, encoding=UTF-8, locale=it]
source reader: ContentAccessor[ contentUrl=store://2016/4/8/15/33/2235d95c-8fb5-4a7d-a630-8ac67f39c286.bin, mimetype=application/pdf, size=144973, encoding=UTF-8, locale=it_IT]
nop, sorry, I was wrong original and OCRed files are there but there is a file (re)naming issue
the Extract OCR rule is trying to read this file
store://2016/4/8/15/33/2235d95c-8fb5-4a7d-a630-8ac67f39c286.bin
which does not exists ; the real source filename is instead
store://2016/4/8/15/33/4e914027-5c78-47e6-8c35-8656c11c62a2.bin
I reckon there is a step within the rule execution where filename is/should be changed but something goes wrong?
Thanks again
Hi, looks like I am running into exactly the same problem. files.zip
I have tried OCRmyPDF as well, but this is failing with another error (some lib is missing version info from python3), let me know if you need details for this as well.
Thank you!
Caused by: org.springframework.dao.ConcurrencyFailureException: Failed to update node 34403
How are you configuring the rule? Is your PDF too short?
It will work if you mark "Run rule in background" option.
It is marked to run in the background. Source is a normal A4 1-page PDF.
And it works if you unset "background"?
yes it does, although adding multiple files gets quite inconvenient this way. do you have an idea why this is depending on background or not?
It looks that your system is too slow and OCR is trying to update the content in rendition phase for your new contents.
We'll try to provide a new release to cover this case.
This might be correct. The system is a VM with 2 cores / 4GB.
Thanks to mquarto and andreklug for your collaboration. Finally, we have find the reason of that weird behaviour. Please, upgrade your AMP to version 1.0.2 and let me know if it works for you.
thank you angelborroy-ks for updating the repo. Unfortunately, still no success. I have verified that the 1.0.2 repo is installed and rebooted server (just in case). The log is attached. alfresco.log.zip
It seems that your system is not properly updated.
Are you using this AMP? https://github.com/keensoft/alfresco-simple-ocr/releases/download/1.0.2/simple-ocr-repo.amp
Can you try with a clean Alfresco installation?
Thanks
Hello everyone I also had trouble configuring simple-ocr, but after a whole day of work and more than a hundred reboots Alfresco, I managed to make this extraordinary add-ons.
I state that my version of Alfresco Community is 5.1.e that runs on Debian 8 64Bit with 4 vCPUs and 4GB of RAM.
Following the installation procedure:
apt-get update
apt-get upgrade
apt-get install zlib1g-dev libjpeg-dev libffi-dev ghostscript tesseract-ocr tesseract-ocr-ita qpdf unpaper python3-pip python3-pil python3-pytest python3-reportlab ocaml imagemagick exactimage
pip3 install ocrmypdf
cd /opt
wget http://downloads.sourceforge.net/project/pdfsandwich/pdfsandwich%200.1.4/pdfsandwich_0.1.4_amd64.deb
dpkg -i pdfsandwich_0.1.4_amd64.deb
apt-get -fy install
/opt/alfresco/alfresco.sh stop
cd /opt/alfresco/amps
wget https://github.com/keensoft/alfresco-simple-ocr/releases/download/1.0.2/simple-ocr-repo.amp
apt-get install default-jre
cd /opt/alfresco/bin
java -jar alfresco-mmt.jar install /opt/alfresco/amps/simple-ocr-repo.amp /opt/alfresco/tomcat/webapps/alfresco.war -verbose
/opt/alfresco/bin/apply_amps.sh
nano /opt/alfresco/scripts/ocr.sh
export PATH=/usr/bin:$PATH pdfsandwich $@
chmod +x /opt/alfresco/scripts/ocr.sh
nano /opt/alfresco/tomcat/shared/classes/alfresco-global.properties
img.root=/usr/share/doc/imagemagick img.exe=/usr/bin/convert img.config=${img.root} img.coders=/usr/lib/x86_64-linux-gnu/ImageMagick-6.8.9/modules-Q16/coders img.dyn=/usr/share/ghostscript/9.06/lib img.gslib=/usr/share/ghostscript/9.06/lib
ocr.command=/opt/alfresco/scripts/ocr.sh ocr.output.verbose=true ocr.output.file.prefix.command=-o ocr.extra.commands=-verbose -lang ita+eng ocr.server.os=linux
/opt/alfresco/alfresco.sh start
Create a folder in the root of the "My Files" on which to send the files to be processed by OCR, for example: Scanner
Create a rule on the Scanner folder.
Thanks for your contribution massimone73.
Can you confirm if both synchronous and asynchronous rules are working with 1.0.2 release?
On the other hand, should be useful to package your script as a Dockerfile
for others?
Thanks
I'm sorry. I do not know how to check. Can you please show me how to verify you are referring to.
When you define a new rule, you can check or uncheck the option "Run in background".
When checked, the action is executed after you upload or change the document. When the action is unchecked, OCR is launched before your uploading, in background.
This issue starts talking about problems with Alfresco installation, but in the last threads a new issue is solved: the addon didn't work in background under some circumstances.
Thanks again.
I confirm. Both methods work perfectly. After you have enabled the 'Run in background' option, I loaded a .tif file and uploading was a breeze. After about 30 seconds I refreshed the page and the same document, this also appeared in pdf. Thank you.
Could you explain how to index all, or nearly all, the words of a document, drawn with simple-ocr? After loading a .tif file, simple-ocr generates another file in pdf. If I try to open the pdf file you just worked with ocr, I can highlight and copy correctly, almost all the words of the document, but if I try to use these words to search the document with the search file function of Alfresco, just some I return the pdf file.
It should work by default. Try to upload the PDF ocrd file to another folder, does SOLR work finding contents for this new file? If not, maybe some index problem has to be solved.
Solved. The problem was due to the quality of the document and the scanner. Today I tried with the multifunction installed in my company and indexing is perfect. Thanks.
Thank you angelborroy-ks for the update 1.0.2 and massimone73 for detailed instructions It now works perfectly and it's a very useful add-on!
Thanks to everyone.
Should we close this issue or it's still something not solved?
I think you should close it, thank you.
Hello, very nice tool to have but I cannot seem to make it work. It produces the ocr file (through pdfsandwich in my case) but cannot write final result and place it in Alfresco repository. Below extract from alfresco.log file, maybe you can help? Thanks in advance Mat
2016-04-07 17:08:56,234 WARN [es.keensoft.alfresco.ocr.OCRExtractAction] [defaultAsyncAction3] org.alfresco.service.cmr.repository.ContentIOException: 03070084 Failed to copy reader content to writer: writer: ContentAccessor[ contentUrl=store://2016/4/7/17/8/da25d86b-dbc1-4b95-ba10-4d2daaef7b17.bin, mimetype=application/pdf, size=994256, encoding=UTF-8, locale=it] source reader: ContentAccessor[ contentUrl=store://2016/4/7/17/6/4a37ad34-b135-49cb-b316-9c7386eaf5bd.bin, mimetype=application/pdf, size=994256, encoding=UTF-8, locale=en_US] org.alfresco.service.cmr.repository.ContentIOException: 03070084 Failed to copy reader content to writer: writer: ContentAccessor[ contentUrl=store://2016/4/7/17/8/da25d86b-dbc1-4b95-ba10-4d2daaef7b17.bin, mimetype=application/pdf, size=994256, encoding=UTF-8, locale=it] source reader: ContentAccessor[ contentUrl=store://2016/4/7/17/6/4a37ad34-b135-49cb-b316-9c7386eaf5bd.bin, mimetype=application/pdf, size=994256, encoding=UTF-8, locale=en_US]