ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
371 stars 79 forks source link

corrupted data when generating a searchable pdf with hocr-pdf #186

Open pprw opened 4 months ago

pprw commented 4 months ago

I am trying to generate a searchable pdf from a jpeg file and a hocr file with the help of hocr-pdf.

I have both files in the same folder. hocr-pdf . > out.pdf generates a pdf but I cannot search inside.

Pdf reader (evince) says "some font thing failed" when displaying the file (I can see the image).

When I extract the text from the pdf

$ pdf2txt out.pdf -o out.txt
WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data

and out.txt contains (excerpt)

(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)

(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)

My hocr file is generated by kraken.

I read from kraken documentation

hOCR output is slightly different from hOCR files produced by ocropus. Each ocr_line span contains not only the bounding box of the line but also character boxes (x_bboxes attribute) indicating the coordinates of each character. In each line alternating sequences of alphanumeric and non-alphanumeric (in the unicode sense) characters are put into ocrx_word spans. Both have bounding boxes as attributes and the recognition confidence for each character in the x_conf attribute.

Paragraph detection has been removed as it was deemed to be unduly dependent on certain typographic features which may not be valid for your input.

So I also tried with an ALTO file (still generated by Kraken), which I convert to hocr format with the help of ocr-fileformat. Same result.

stefan6419846 commented 4 months ago

Which version of reportlab are you using? As far as I am aware, reportlab>=4.1.0 breaks hocr-pdf.

pprw commented 4 months ago

Thanks for the information.

I was using reportlab 4.2.2. I downgraded to 4.0.9.

Now I do not have anymore the WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data

but I cannot search inside the pdf and pdf2text creates a file filled with:

image

misters2008 commented 4 months ago

pprw, i am having the same issue with these symbols instead of normal text. Were you able to fix it by now?

stefan6419846 commented 4 months ago

Does it work with pdftotext file.pdf -? At least during my testing, this would generate a PDF file with a valid text layer when using the hocr-tools master branch (due to unfixed issues in the release on Python 3.10) and using reportlab==4.0.9.

pprw commented 1 month ago

Sorry for the late reply.

pdftotext file.pdf - does not display anything.

I installed reportlab .0.9 and master version of hocr-tools

pipx install reportlab==4.0.9 --include-deps --force
pipx install git+https://github.com/ocropus/hocr-tools.git@master --force

I have commented line 30 and 116 of hocr-pdf file because of an error about bidi library.

line 30: from bidi.algorithm import get_display           
line 116:  rawtext = get_display(rawtext)

I opened a specific issue about this. #188

So maybe it is related to this. I am trying to fix the bidi error and will see after that if there is any change.

stefan6419846 commented 1 month ago

This most likely is the same issue as in https://github.com/ocropus/hocr-tools/issues/188#issuecomment-2402585611, id est you are not using pipx as your tool of choice correctly. hocr-tools currently does not pin reportlab to a compatible version, thus

pipx install git+https://github.com/ocropus/hocr-tools.git@master --force

should indicate that you are indeed installing/using the latest reportlab version for hocr-tools and not version 4.0.9.

pprw commented 1 month ago

Thank for the comment.

I reinstalled hocr-tools without using pipx and in the same environment

$ python3 -m venv $HOME/.venvs/hocr
$ source $HOME/.venvs/hocr/bin/activate
$  pip install hocr-tools
Collecting hocr-tools
  Using cached hocr_tools-1.1.1-py3-none-any.whl
Collecting Pillow
  Downloading pillow-10.4.0-cp311-cp311-manylinux_2_28_x86_64.whl (4.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 28.2 MB/s eta 0:00:00
Collecting lxml
  Using cached lxml-5.3.0-cp311-cp311-manylinux_2_28_x86_64.whl (5.0 MB)
Collecting reportlab
  Using cached reportlab-4.2.5-py3-none-any.whl (1.9 MB)
Collecting chardet
  Using cached chardet-5.2.0-py3-none-any.whl (199 kB)
Installing collected packages: Pillow, lxml, chardet, reportlab, hocr-tools
Successfully installed Pillow-10.4.0 chardet-5.2.0 hocr-tools-1.1.1 lxml-5.3.0 reportlab-4.2.5

hocr-pdf . > output.pdf generates no error but the file is still not readable

$ pdftotext output.pdf -
Syntax Error (2441217): Illegal character <24> in hex string
Syntax Error (2441218): Illegal character <22> in hex string
Syntax Error (2441220): Illegal character <47> in hex string
Syntax Error (2441221): Illegal character <69> in hex string
Syntax Error (2441222): Illegal character <68> in hex string
Syntax Error (2441224): Illegal character <6b> in hex string
Syntax Error (2441225): Illegal character <5c> in hex string
Syntax Error (2441226): Illegal character <4b> in hex string
Syntax Error (2441227): Illegal character <3f> in hex string
Syntax Error (2441229): Illegal character <71> in hex string
Syntax Error (2441231): Illegal character <56> in hex string
Syntax Error (2441232): Illegal character <27> in hex string
Syntax Error (2441233): Illegal character <40> in hex string
Syntax Error (2441234): Illegal character <4d> in hex string
Syntax Error (2441236): Illegal character <2c> in hex string
Syntax Error (2441237): Illegal character <2e> in hex string
Syntax Error (2441238): Illegal character <51> in hex string
Syntax Error (2441240): Illegal character <5f> in hex string
Syntax Error (2441241): Illegal character <24> in hex string
Syntax Error (2441242): Illegal character <58> in hex string
Syntax Error (2441243): Illegal character <3a> in hex string
Syntax Error (2441244): Illegal character <3f> in hex string
Syntax Error (2441245): Illegal character <6b> in hex string
Syntax Error (2441246): Illegal character <23> in hex string
Syntax Error (2441247): Illegal character <2f> in hex string
Syntax Error (2441248): Illegal character <6d> in hex string
Syntax Error (2441249): Illegal character <73> in hex string
Syntax Error (2441250): Illegal character <6d> in hex string
Syntax Error (2441251): Illegal character <6d> in hex string
Syntax Error (2441252): Illegal character <51> in hex string
Syntax Error (2441253): Illegal character <2f> in hex string
Syntax Error (2441255): Illegal character <54> in hex string
Syntax Error (2441256): Illegal character <24> in hex string
Syntax Error (2441257): Illegal character <48> in hex string
Syntax Error (2441261): Illegal character <5b> in hex string
Syntax Error (2441262): Illegal character <70> in hex string
Syntax Error (2441263): Illegal character <2f> in hex string
Syntax Error (2441264): Illegal character <68> in hex string
Syntax Error (2441265): Illegal character <71> in hex string
Syntax Error (2441266): Illegal character <59> in hex string
Syntax Error (2441267): Illegal character <2c> in hex string
Syntax Error (2441268): Illegal character <3c> in hex string
Syntax Error (2441269): Illegal character <5f> in hex string
Syntax Error (2441270): Illegal character <57> in hex string
Syntax Error (2441273): Illegal character <50> in hex string
Syntax Error (2441275): Illegal character <69> in hex string
Syntax Error (2441276): Illegal character <40> in hex string
Syntax Error (2441278): Illegal character <4c> in hex string
Syntax Error (2441280): Illegal character <70> in hex string
Syntax Error (2441281): Illegal character <5d> in hex string
Syntax Error (2441282): Illegal character <4a> in hex string
Syntax Error (2441283): Illegal character <23> in hex string
Syntax Error (2441284): Illegal character <59> in hex string
Syntax Error (2441285): Illegal character <56> in hex string
Syntax Error (2441287): Illegal character <71> in hex string
Syntax Error (2441288): Illegal character <5e> in hex string
Syntax Error (2441290): Illegal character <4c> in hex string
Syntax Error (2441291): Illegal character <28> in hex string
Syntax Error (2441292): Illegal character <24> in hex string
Syntax Error (2441293): Illegal character <2f> in hex string
Syntax Error (2441294): Illegal character <55> in hex string
Syntax Error (2441217): Illegal character <24> in hex string
Syntax Error (2441218): Illegal character <22> in hex string
Syntax Error (2441220): Illegal character <47> in hex string
Syntax Error (2441221): Illegal character <69> in hex string
Syntax Error (2441222): Illegal character <68> in hex string
Syntax Error (2441224): Illegal character <6b> in hex string
Syntax Error (2441225): Illegal character <5c> in hex string
Syntax Error (2441226): Illegal character <4b> in hex string
Syntax Error (2441227): Illegal character <3f> in hex string
Syntax Error (2441229): Illegal character <71> in hex string
Syntax Error (2441231): Illegal character <56> in hex string
Syntax Error (2441232): Illegal character <27> in hex string
Syntax Error (2441233): Illegal character <40> in hex string
Syntax Error (2441234): Illegal character <4d> in hex string
Syntax Error (2441236): Illegal character <2c> in hex string
Syntax Error (2441237): Illegal character <2e> in hex string
Syntax Error (2441238): Illegal character <51> in hex string
Syntax Error (2441240): Illegal character <5f> in hex string
Syntax Error (2441241): Illegal character <24> in hex string
Syntax Error (2441242): Illegal character <58> in hex string
Syntax Error (2441243): Illegal character <3a> in hex string
Syntax Error (2441244): Illegal character <3f> in hex string
Syntax Error (2441245): Illegal character <6b> in hex string
Syntax Error (2441246): Illegal character <23> in hex string
Syntax Error (2441247): Illegal character <2f> in hex string
Syntax Error (2441248): Illegal character <6d> in hex string
Syntax Error (2441249): Illegal character <73> in hex string
Syntax Error (2441250): Illegal character <6d> in hex string
Syntax Error (2441251): Illegal character <6d> in hex string
Syntax Error (2441252): Illegal character <51> in hex string
Syntax Error (2441253): Illegal character <2f> in hex string
Syntax Error (2441255): Illegal character <54> in hex string
Syntax Error (2441256): Illegal character <24> in hex string
Syntax Error (2441257): Illegal character <48> in hex string
Syntax Error (2441261): Illegal character <5b> in hex string
Syntax Error (2441262): Illegal character <70> in hex string
Syntax Error (2441263): Illegal character <2f> in hex string
Syntax Error (2441264): Illegal character <68> in hex string
Syntax Error (2441265): Illegal character <71> in hex string
Syntax Error (2441266): Illegal character <59> in hex string
Syntax Error (2441267): Illegal character <2c> in hex string
Syntax Error (2441268): Illegal character <3c> in hex string
Syntax Error (2441269): Illegal character <5f> in hex string
Syntax Error (2441270): Illegal character <57> in hex string
Syntax Error (2441273): Illegal character <50> in hex string
Syntax Error (2441275): Illegal character <69> in hex string
Syntax Error (2441276): Illegal character <40> in hex string
Syntax Error (2441278): Illegal character <4c> in hex string
Syntax Error (2441280): Illegal character <70> in hex string
Syntax Error (2441281): Illegal character <5d> in hex string
Syntax Error (2441282): Illegal character <4a> in hex string
Syntax Error (2441283): Illegal character <23> in hex string
Syntax Error (2441284): Illegal character <59> in hex string
Syntax Error (2441285): Illegal character <56> in hex string
Syntax Error (2441287): Illegal character <71> in hex string
Syntax Error (2441288): Illegal character <5e> in hex string
Syntax Error (2441290): Illegal character <4c> in hex string
Syntax Error (2441291): Illegal character <28> in hex string
Syntax Error (2441292): Illegal character <24> in hex string
Syntax Error (2441293): Illegal character <2f> in hex string
Syntax Error (2441294): Illegal character <55> in hex string
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Invalid XRef entry 0
Syntax Error (2437693): Missing 'endstream' or incorrect stream length
Syntax Error (2436161): Bad FCHECK in flate stream
Syntax Error: Embedded font file may be invalid
Syntax Error (2436088): Missing 'endstream' or incorrect stream length
Syntax Error (2435010): Bad FCHECK in flate stream
stefan6419846 commented 1 month ago

Because you are using reportlab==4.2.5. Please force reportlab==4.0.9.

pprw commented 1 month ago

Sorry, I noticed that just after commenting.

With pip install reportlab==4.0.9 --force

I have a pdf with a readable text layout.

pdf2txt complains still about corrupted data

$ pdf2txt output.pdf -o out.txt
WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data

pdftotext output.pdf -

displays the text

Evince (pdf reader) complains a lot with "some font thing failed" when reading the pdf but search works

stefan6419846 commented 1 month ago

I have not validated other tools further, but you might want to have a look at https://github.com/stefan6419846/hocr-tools which fixes both the compatibility with recent reportlab versions and includes #178 which might fix some of these aspects.

pprw commented 1 month ago

I think my problem is related to accent support. The recognized text is in French and I cannot search accented letters in the output pdf create by hocr

for example: "formé" in the hocr file is "formeÄ" in the output pdf

I will try your fork