ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
364 stars 79 forks source link

Error while using hocr-pdf file #121

Closed shekarnode closed 6 years ago

shekarnode commented 6 years ago

While using the below command i m getting error related to character help out please

hocr-pdf . > out.pdf
Traceback (most recent call last):
  File "C:\Python36\Scripts\hocr-pdf.py", line 143, in <module>
    export_pdf(args.imgdir, 300)
  File "C:\Python36\Scripts\hocr-pdf.py", line 70, in export_pdf
    pdf.save()
  File "c:\python36\lib\site-packages\reportlab\pdfgen\canvas.py", line 1237, in save
    self._doc.SaveToFile(self._filename, self)
  File "c:\python36\lib\site-packages\reportlab\pdfbase\pdfdoc.py", line 224, in SaveToFile
    f.write(data)
  File "C:\Python36\Scripts\hocr-pdf.py", line 47, in write
    sys.stdout.write(data)
  File "c:\python36\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 11-14: character maps to <undefined>
stweil commented 6 years ago

Can you provide a hOCR file which causes this error? How did you create it?

shekarnode commented 6 years ago

I used Tesseract 4.0.0 to generate hocr Hocr File

This is the image for above generate Hocr e3_out

shekarnode commented 6 years ago

Is there any other solution for getting table from hocr data ?

zuphilip commented 6 years ago

This works for me as well after I have renamed the image and converted it to a jpg file.

  1. Do you have the jpg file also in your directory?
  2. What is your environment? Linux or Windows?
  3. What Python version do you use? python -V
  4. What is the encoding of your bash which Python uses?
shekarnode commented 6 years ago

@zuphilip

  1. i was using png image for conversion , now i replaced it with jpg.
  2. Environment - Windows
  3. Python 3.6.4
  4. well i was using cmd to get output , tried with git bash , i got pdf as output but it was just a normal pdf i.e. not in searchable format.

are you able to generate searchable pdf ?

amitdo commented 6 years ago

Tesseract has an option to output to pdf. Did you tried it?

zuphilip commented 6 years ago

are you able to generate searchable pdf ?

Yes, I see a searchable PDF, but I am working on Linux.

For windows terminal the encoding can be a problem. You can check the encoding for python in windows terminal by starting python and then type

>>> import sys
>>> sys.stdout.encoding

If that is now UTF-8 then you can try to run the command with PYTHONIOENCODING=UTF-8 in front, i.e.

PYTHONIOENCODING=UTF-8 hocr-pdf . > out.pdf

i got pdf as output but it was just a normal pdf i.e. not in searchable format.

This is with the git bash on windows, right? Can you upload your result here?

shekarnode commented 6 years ago

@zuphilip out.pdf this the pdf file being generated

@amitdo i have tried generating searchable pdf from tesseract also: the commands are provided over here were used . still the output is not searchable fromat its just simple pdf with image.

zuphilip commented 6 years ago

@shekarnode There is text in your generated PDF and I can search for text as well.

shekarnode commented 6 years ago

I was using adobe reader and all the time was not able to search ,now when I opened the pdf in browser I found out it was searchable.

Thanks @zuphilip for helping out.

amitdo commented 6 years ago

The pdf produced by Tesseract is also searchable.