openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
931 stars 152 forks source link

File not found #92

Closed tangb closed 6 years ago

tangb commented 6 years ago

Hello

I'm trying to use pyocr with tesseract 4.0.0 alpha and I got error about file not found during generation. It works well with TextBuilder and DigitBuilder but fails with LineBoxBuilder, WordBoxBuilder.

I'm running under debian jessie (v8)

Can you help me ? Thank you :smile:

Tesseract infos:

tesseract --version
tesseract 4.00.00alpha
 leptonica-1.74.4
  libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.3.1) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.1
 Found AVX2
 Found AVX
 Found SSE

Python code

res=tools[0].image_to_string(img, lang='fra', builder=pyocr.builders.WordBoxBuilder())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.4/dist-packages/pyocr/tesseract.py", line 385, in image_to_string
    " last name tried: %s" % output_file_name)
pyocr.error.TesseractError: (-1, 'Unable to find output file last name tried: /tmp/tess_7wuffwb1/output.hocr')

Content of tmp during process:

/tmp content:
/tmp/tess_7wuffwb1
/tmp/tess_7wuffwb1/input.bmp
/tmp/tess_7wuffwb1/output.txt
jflesch commented 6 years ago

Since it generated a output.txt instead of a output.hocr or output.html, my guess would be that you're missing the configuration file hocr (/usr/share/tesseract-ocr/tessdata/configs/hocr with Tesseract 3.05 in Debian). If so, it shouldn't have been silenced however.

tangb commented 6 years ago

I compiled tesseract on my own and I got a hocr file in /usr/local/share/tessdata/configs/hocr I got this output with --print-parameters command option:

tesseract --print-parameters | grep hocr hocr_font_info 0 Add font info to hocr output tessedit_create_hocr 0 Write .html hOCR output file

0 means disabled ? Is there a way to make sure tesseract uses the hocr specified above?

Thank you for your help

jflesch commented 6 years ago

0 = disabled. But this is to be expected since you didn't specify to Tesseract that it must use the hocr configuration file.

% tesseract --print-parameters | grep hocr
hocr_font_info  0   Add font info to hocr output
tessedit_create_hocr    0   Write .html hOCR output file

% tesseract --print-parameters randomfile.jpeg randomoutputfile hocr | grep hocr
hocr_font_info  0   Add font info to hocr output
tessedit_create_hocr    1   Write .html hOCR output file

What is the content of your /usr/local/share/tessdata/configs/hocr ?

tangb commented 6 years ago

I rebuild completely tesseract with latest version and I got no problem. It seems it was an issue with my tessdata path, files were installed in different place...

Thank you for your help ;-)

jflesch commented 6 years ago

You're welcome