ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
359 stars 78 forks source link

Line images using HOCR #160

Open rraina97 opened 4 years ago

rraina97 commented 4 years ago

I want to create training data (line images and coresponding .txt files)for arabic languages using arabic documents. I used tesseract with hocr to create a hocr file and then used hocr-extract-images to get line data. But the hocr file created is of very low accuracy (maybe it depends on the already trained tesseract model). Is there any other method to create line images which can be used to train tesseract and thus imporve its accuracy.

zvezdochiot commented 4 years ago

@rraina97, goto https://github.com/tesseract-ocr/tesseract !

kba commented 4 years ago

But the hocr file created is of very low accuracy (maybe it depends on the already trained tesseract model)

IIUC you are not happy with the line segmentation? In that case you should indeed investigate tesseract and the documentation. There's also other tools for line segmentation, in ocropy, sbb_textline_detection, and several implementations in OCR-D.

Once you have an hOCR file with the right segmentation, we can support you with the right hocr-tools invocation.