corrupted data when generating a searchable pdf with hocr-pdf

pprw commented 4 days ago

I am trying to generate a searchable pdf from a jpeg file and a hocr file with the help of hocr-pdf.

I have both files in the same folder. hocr-pdf . > out.pdf generates a pdf but I cannot search inside.

Pdf reader (evince) says "some font thing failed" when displaying the file (I can see the image).

When I extract the text from the pdf

$ pdf2txt out.pdf -o out.txt
WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data

and out.txt contains (excerpt)

(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)

(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)

My hocr file is generated by kraken.

I read from kraken documentation

hOCR output is slightly different from hOCR files produced by ocropus. Each ocr_line span contains not only the bounding box of the line but also character boxes (x_bboxes attribute) indicating the coordinates of each character. In each line alternating sequences of alphanumeric and non-alphanumeric (in the unicode sense) characters are put into ocrx_word spans. Both have bounding boxes as attributes and the recognition confidence for each character in the x_conf attribute.

Paragraph detection has been removed as it was deemed to be unduly dependent on certain typographic features which may not be valid for your input.

So I also tried with an ALTO file (still generated by Kraken), which I convert to hocr format with the help of ocr-fileformat. Same result.

stefan6419846 commented 4 days ago

Which version of reportlab are you using? As far as I am aware, reportlab>=4.1.0 breaks hocr-pdf.

pprw commented 2 days ago

Thanks for the information.

I was using reportlab 4.2.2. I downgraded to 4.0.9.

Now I do not have anymore the WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data

but I cannot search inside the pdf and pdf2text creates a file filled with:

ocropus / hocr-tools

corrupted data when generating a searchable pdf with hocr-pdf #186