gkovacs / pdfocr

Adds text to PDF files using the cuneiform OCR software
MIT License
325 stars 49 forks source link

bypassing hocr2pdf that can't handle the new hocr format #12

Closed snowboard975 closed 10 years ago

snowboard975 commented 10 years ago

tested versions: hocr2pdf v0.8.9 tesseract v3.03 pdfocr v0.1.4

symptom: An output pdf of hocr2pdf does not contain all the texts from an output of tesseract. When I open an output file of hocr2pdf in a pdf viewer and try to find a word, the locations of found words are displayed in a wrong position and many words are not searchable at all. It looks like font sizes overlapped text or locations of the texts are wrong. It seems this is a bug of hocr2pdf that cannot handle word by word format of hocr.

https://bugs.launchpad.net/cuneiform-linux/+bug/623438

hocr2pdf seems that it can only handle a hocr file that specifies a location of each character. It seems it cannot handle a new format of hocr file that specifies a location of each word. However tesseract generates a hocr file that specifies locations of words, not characters.

fix: Below code bypasses this bug and enables pdfocr to do the job without hocr2pdf by using tesseract's feature to make a sandwich pdf.