ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
371 stars 79 forks source link

Bilingual Text Encoding is not Working for Kannada-English Output Hocr File #176

Open vaibhavsanil opened 2 years ago

vaibhavsanil commented 2 years ago

I am facing issues with hocr pdf conversion for English Kannada encoded into the text layer of the PDF File

I have a image below in kannada language (https://drive.google.com/file/d/11P2XMFWjmc0S6rzfOX58UtZZJkG2StNI/view?usp=sharing)

following is the corresponding output hocr of the file https://drive.google.com/file/d/1wm-40rCN_rSE4cqT499kZAjAs5y6A3xl/view?usp=sharing

following is output of the gcv ocr for the particular file in JSON OCR Output in JSON

The output of hocr-pdf conversion is as follows Hocr-PDF output

As you can see if you search for english words it will highlight ,but for kannada language its giving gibberish results in the output file generated using hocr-pdf conversion

Any guidance in this regards is appreciated