ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
359 stars 78 forks source link

hocr-pdf does not work with tesseract #161

Open vstepaniuk opened 4 years ago

vstepaniuk commented 4 years ago

When I execute this:

$ tesseract img.png img hocr
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
$ hocr-pdf . > new.pdf
/usr/local/bin/hocr-pdf:134: DeprecationWarning: decodestring() is a deprecated alias since Python 3.1, use decodebytes()
  uncompressed = bytearray(zlib.decompress(base64.decodestring(font)))

I get a corrupt PDF file, and evince says "The document contains no pages". see sample.zip for the files themselves.

Tesseract 4.1.1 Python 3.8.2 hocr-tools-1.1.1

maltaisn commented 3 years ago

Only JPEG images are supported, not PNG.

joewiz commented 3 years ago

I get a similar error using the supplied sample.zip, even after converting Tesseract's PNGs to JPG.

Before (with just PNGs):

% hocr-pdf . > new.pdf
Traceback (most recent call last):
  File "/usr/local/bin/hocr-pdf", line 143, in <module>
    export_pdf(args.imgdir, 300)
  File "/usr/local/bin/hocr-pdf", line 51, in export_pdf
    load_invisible_font()
  File "/usr/local/bin/hocr-pdf", line 134, in load_invisible_font
    uncompressed = bytearray(zlib.decompress(base64.decodestring(font)))
AttributeError: module 'base64' has no attribute 'decodestring'

After converting the PNGs to JPG (using imagemagick, i.e., convert new.png new.jpg):

% hocr-pdf . > new.pdf
Traceback (most recent call last):
  File "/usr/local/bin/hocr-pdf", line 143, in <module>
    export_pdf(args.imgdir, 300)
  File "/usr/local/bin/hocr-pdf", line 51, in export_pdf
    load_invisible_font()
  File "/usr/local/bin/hocr-pdf", line 134, in load_invisible_font
    uncompressed = bytearray(zlib.decompress(base64.decodestring(font)))
AttributeError: module 'base64' has no attribute 'decodestring'

Tesseract v5.0.0-alpha-20210401 with Leptonica (via macOS Homebrew brew install tesseract --HEAD) Python 3.9.2 hocr-tools 1.1.1