Open pprw opened 4 days ago
Which version of reportlab
are you using? As far as I am aware, reportlab>=4.1.0
breaks hocr-pdf
.
Thanks for the information.
I was using reportlab 4.2.2. I downgraded to 4.0.9.
Now I do not have anymore the
WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data
but I cannot search inside the pdf and pdf2text creates a file filled with:
I am trying to generate a searchable pdf from a jpeg file and a hocr file with the help of hocr-pdf.
I have both files in the same folder.
hocr-pdf . > out.pdf
generates a pdf but I cannot search inside.Pdf reader (evince) says "some font thing failed" when displaying the file (I can see the image).
When I extract the text from the pdf
and out.txt contains (excerpt)
My hocr file is generated by kraken.
I read from kraken documentation
So I also tried with an ALTO file (still generated by Kraken), which I convert to hocr format with the help of ocr-fileformat. Same result.