ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
364 stars 79 forks source link

hocr-pdf printing Hebrew text in opposite direction in the generated pdf file #163

Open smijo149 opened 3 years ago

smijo149 commented 3 years ago

The pdf file generated using hocr-pdf has Hebrew text printed in the opposite direction.

Steps I followed:

  1. I used Google cloud vision to get the OCR
  2. Used gcv2hocr to generate hocr.
  3. Used hocr-pdf --savefile output.pdf actual-file.jpg to generate pdf file.

The pdf file has Hebrew text inserted in it but in the reverse order.

Actual image:

Screen Shot 2021-02-01 at 6 48 35 PM

This is how my hocr file looks:

Screen Shot 2021-02-01 at 7 01 04 PM

Text in pdf file: (I have set text visibility mode to 0 so that the inserted text is visible)

Screen Shot 2021-02-01 at 6 48 56 PM

Hebrew is a right to left language so not sure if I have to pass any language or direction parameters to get this right.

stweil commented 3 years ago

I am afraid that hocr-pdf was never tested with RTL text. Using bidi like in https://github.com/tesseract-ocr/tesstrain/blob/master/generate_wordstr_box.py might fix that.

smijo149 commented 3 years ago

Thanks! I will try it out and see if that works for me.

joewiz commented 3 years ago

@smijo149 Looks like you solved this. I wonder if the maintainers of hocr-tools would be interested in your PR?

smijo149 commented 3 years ago

@joewiz Yeah I was able to solve the issue based on @stweil suggestion. I have opened a PR #165 if anyone is interested. Thanks!