tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.3k stars 9.52k forks source link

FileNotFoundError while use image_to_pdf_or_hocr #2356

Closed nathan30 closed 5 years ago

nathan30 commented 5 years ago

Environment

Current Behavior:

The img I give as input of pytesseract.image_to_pdf_or_hocr is something like : /home/edissyum/opencapture/data/tmp/tmp-0.jpg But here is the error :

Traceback (most recent call last): File "src/main.py", line 80, in process(args, path, Log, Separator, Config, Image, Ocr, Locale, WebService) File "/home/edissyum/opencapture/src/process/OCForMaarch.py", line 61, in process Ocr.generate_searchable_pdf(file, Image, Config) File "/home/edissyum/opencapture/src/classes/PyTesseract.py", line 43, in generate_searchable_pdf extension='pdf' File "/usr/local/lib/python3.5/dist-packages/pytesseract/pytesseract.py", line 325, in image_to_pdf_or_hocr return run_and_get_output(*args) File "/usr/local/lib/python3.5/dist-packages/pytesseract/pytesseract.py", line 220, in run_and_get_output with open(filename, 'rb') as output_file: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tess_yy447kdg_out.pdf'

Expected Behavior:

Return the searchable PDF file

Suggested Fix:

?

zdenop commented 5 years ago

Please respect guidelines for posting issue: we do not provide support for 3rd party projects.

NicoLivesey commented 5 years ago

Hi @nathan30, did you manage to fix this issue ?

nathan30 commented 5 years ago

Hi @nathan30, did you manage to fix this issue ?

Hi,

Yes, hard to remember the solution I found but I think it's related to imageMagick policies

NicoLivesey commented 5 years ago

Hi, thanks for your quick reply. I am a bit late on this issue sorry haha. What did you do with ImageMagick in order to fix this ? Thanks a lot

nathan30 commented 5 years ago

Hi, thanks for your quick reply. I am a bit late on this issue sorry haha. What did you do with ImageMagick in order to fix this ? Thanks a lot

Edit this file : /etc/ImageMagick-X/policy.xml

Try in first to comment all this line :

<policy domain="resource" name="memory" value="256MiB"/>
<policy domain="resource" name="map" value="512MiB"/>
<policy domain="resource" name="width" value="16KP"/>
<policy domain="resource" name="height" value="16KP"/>
<policy domain="resource" name="area" value="128MB"/>
<policy domain="resource" name="disk" value="1GiB"/>

Then retry your script, if your error disapear try to uncomment lines one by one until you found the one which cause the issue and then augment the value. Hope I remember well ahah

NicoLivesey commented 5 years ago

Ok that's very clear I will try that, thank you very much for your help !

nasheedyasin commented 5 years ago

Ok that's very clear I will try that, thank you very much for your help !

Hi, any updates on whether it worked? I am facing the same issue.

prameshbajra commented 3 years ago

@nasheedyasin Did you find a solution? I am using image_to_pdf_or_hocr on an AWS lambda function and it is giving me the same error.

nasheedyasin commented 3 years ago

@nasheedyasin

Did you find a solution? I am using image_to_pdf_or_hocr on an AWS lambda function and it is giving me the same error.

I don't quite recall, but I'll tell you this, I soon found out that I was using at times, images that are all together hard to be ocred due to their size. The XML configuration file of ImageMagick that is being commented out seems to be taking away the defined limits. While this would be okay/ work in experimentation, if you intend to productionalize your solution, consider that such a thing would not be the best thing to do.