symptom:
An output pdf of hocr2pdf does not contain all the texts from an output of tesseract. When I open an output file of hocr2pdf in a pdf viewer and try to find a word, the locations of found words are displayed in a wrong position and many words are not searchable at all. It looks like font sizes overlapped text or locations of the texts are wrong. It seems this is a bug of hocr2pdf that cannot handle word by word format of hocr.
hocr2pdf seems that it can only handle a hocr file that specifies a location of each character. It seems it cannot handle a new format of hocr file that specifies a location of each word. However tesseract generates a hocr file that specifies locations of words, not characters.
fix:
Below code bypasses this bug and enables pdfocr to do the job without hocr2pdf by using tesseract's feature to make a sandwich pdf.
tested versions: hocr2pdf v0.8.9 tesseract v3.03 pdfocr v0.1.4
symptom: An output pdf of hocr2pdf does not contain all the texts from an output of tesseract. When I open an output file of hocr2pdf in a pdf viewer and try to find a word, the locations of found words are displayed in a wrong position and many words are not searchable at all. It looks like font sizes overlapped text or locations of the texts are wrong. It seems this is a bug of hocr2pdf that cannot handle word by word format of hocr.
https://bugs.launchpad.net/cuneiform-linux/+bug/623438
hocr2pdf seems that it can only handle a hocr file that specifies a location of each character. It seems it cannot handle a new format of hocr file that specifies a location of each word. However tesseract generates a hocr file that specifies locations of words, not characters.
fix: Below code bypasses this bug and enables pdfocr to do the job without hocr2pdf by using tesseract's feature to make a sandwich pdf.