Open pprw opened 4 months ago
Which version of reportlab
are you using? As far as I am aware, reportlab>=4.1.0
breaks hocr-pdf
.
Thanks for the information.
I was using reportlab 4.2.2. I downgraded to 4.0.9.
Now I do not have anymore the
WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data
but I cannot search inside the pdf and pdf2text creates a file filled with:
pprw, i am having the same issue with these symbols instead of normal text. Were you able to fix it by now?
Does it work with pdftotext file.pdf -
? At least during my testing, this would generate a PDF file with a valid text layer when using the hocr-tools
master branch (due to unfixed issues in the release on Python 3.10) and using reportlab==4.0.9
.
Sorry for the late reply.
pdftotext file.pdf -
does not display anything.
I installed reportlab .0.9 and master version of hocr-tools
pipx install reportlab==4.0.9 --include-deps --force
pipx install git+https://github.com/ocropus/hocr-tools.git@master --force
I have commented line 30 and 116 of hocr-pdf file because of an error about bidi library.
line 30: from bidi.algorithm import get_display
line 116: rawtext = get_display(rawtext)
I opened a specific issue about this. #188
So maybe it is related to this. I am trying to fix the bidi error and will see after that if there is any change.
This most likely is the same issue as in https://github.com/ocropus/hocr-tools/issues/188#issuecomment-2402585611, id est you are not using pipx
as your tool of choice correctly. hocr-tools
currently does not pin reportlab
to a compatible version, thus
pipx install git+https://github.com/ocropus/hocr-tools.git@master --force
should indicate that you are indeed installing/using the latest reportlab
version for hocr-tools
and not version 4.0.9.
Thank for the comment.
I reinstalled hocr-tools without using pipx and in the same environment
$ python3 -m venv $HOME/.venvs/hocr
$ source $HOME/.venvs/hocr/bin/activate
$ pip install hocr-tools
Collecting hocr-tools
Using cached hocr_tools-1.1.1-py3-none-any.whl
Collecting Pillow
Downloading pillow-10.4.0-cp311-cp311-manylinux_2_28_x86_64.whl (4.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 28.2 MB/s eta 0:00:00
Collecting lxml
Using cached lxml-5.3.0-cp311-cp311-manylinux_2_28_x86_64.whl (5.0 MB)
Collecting reportlab
Using cached reportlab-4.2.5-py3-none-any.whl (1.9 MB)
Collecting chardet
Using cached chardet-5.2.0-py3-none-any.whl (199 kB)
Installing collected packages: Pillow, lxml, chardet, reportlab, hocr-tools
Successfully installed Pillow-10.4.0 chardet-5.2.0 hocr-tools-1.1.1 lxml-5.3.0 reportlab-4.2.5
hocr-pdf . > output.pdf
generates no error but the file is still not readable
$ pdftotext output.pdf -
Syntax Error (2441217): Illegal character <24> in hex string
Syntax Error (2441218): Illegal character <22> in hex string
Syntax Error (2441220): Illegal character <47> in hex string
Syntax Error (2441221): Illegal character <69> in hex string
Syntax Error (2441222): Illegal character <68> in hex string
Syntax Error (2441224): Illegal character <6b> in hex string
Syntax Error (2441225): Illegal character <5c> in hex string
Syntax Error (2441226): Illegal character <4b> in hex string
Syntax Error (2441227): Illegal character <3f> in hex string
Syntax Error (2441229): Illegal character <71> in hex string
Syntax Error (2441231): Illegal character <56> in hex string
Syntax Error (2441232): Illegal character <27> in hex string
Syntax Error (2441233): Illegal character <40> in hex string
Syntax Error (2441234): Illegal character <4d> in hex string
Syntax Error (2441236): Illegal character <2c> in hex string
Syntax Error (2441237): Illegal character <2e> in hex string
Syntax Error (2441238): Illegal character <51> in hex string
Syntax Error (2441240): Illegal character <5f> in hex string
Syntax Error (2441241): Illegal character <24> in hex string
Syntax Error (2441242): Illegal character <58> in hex string
Syntax Error (2441243): Illegal character <3a> in hex string
Syntax Error (2441244): Illegal character <3f> in hex string
Syntax Error (2441245): Illegal character <6b> in hex string
Syntax Error (2441246): Illegal character <23> in hex string
Syntax Error (2441247): Illegal character <2f> in hex string
Syntax Error (2441248): Illegal character <6d> in hex string
Syntax Error (2441249): Illegal character <73> in hex string
Syntax Error (2441250): Illegal character <6d> in hex string
Syntax Error (2441251): Illegal character <6d> in hex string
Syntax Error (2441252): Illegal character <51> in hex string
Syntax Error (2441253): Illegal character <2f> in hex string
Syntax Error (2441255): Illegal character <54> in hex string
Syntax Error (2441256): Illegal character <24> in hex string
Syntax Error (2441257): Illegal character <48> in hex string
Syntax Error (2441261): Illegal character <5b> in hex string
Syntax Error (2441262): Illegal character <70> in hex string
Syntax Error (2441263): Illegal character <2f> in hex string
Syntax Error (2441264): Illegal character <68> in hex string
Syntax Error (2441265): Illegal character <71> in hex string
Syntax Error (2441266): Illegal character <59> in hex string
Syntax Error (2441267): Illegal character <2c> in hex string
Syntax Error (2441268): Illegal character <3c> in hex string
Syntax Error (2441269): Illegal character <5f> in hex string
Syntax Error (2441270): Illegal character <57> in hex string
Syntax Error (2441273): Illegal character <50> in hex string
Syntax Error (2441275): Illegal character <69> in hex string
Syntax Error (2441276): Illegal character <40> in hex string
Syntax Error (2441278): Illegal character <4c> in hex string
Syntax Error (2441280): Illegal character <70> in hex string
Syntax Error (2441281): Illegal character <5d> in hex string
Syntax Error (2441282): Illegal character <4a> in hex string
Syntax Error (2441283): Illegal character <23> in hex string
Syntax Error (2441284): Illegal character <59> in hex string
Syntax Error (2441285): Illegal character <56> in hex string
Syntax Error (2441287): Illegal character <71> in hex string
Syntax Error (2441288): Illegal character <5e> in hex string
Syntax Error (2441290): Illegal character <4c> in hex string
Syntax Error (2441291): Illegal character <28> in hex string
Syntax Error (2441292): Illegal character <24> in hex string
Syntax Error (2441293): Illegal character <2f> in hex string
Syntax Error (2441294): Illegal character <55> in hex string
Syntax Error (2441217): Illegal character <24> in hex string
Syntax Error (2441218): Illegal character <22> in hex string
Syntax Error (2441220): Illegal character <47> in hex string
Syntax Error (2441221): Illegal character <69> in hex string
Syntax Error (2441222): Illegal character <68> in hex string
Syntax Error (2441224): Illegal character <6b> in hex string
Syntax Error (2441225): Illegal character <5c> in hex string
Syntax Error (2441226): Illegal character <4b> in hex string
Syntax Error (2441227): Illegal character <3f> in hex string
Syntax Error (2441229): Illegal character <71> in hex string
Syntax Error (2441231): Illegal character <56> in hex string
Syntax Error (2441232): Illegal character <27> in hex string
Syntax Error (2441233): Illegal character <40> in hex string
Syntax Error (2441234): Illegal character <4d> in hex string
Syntax Error (2441236): Illegal character <2c> in hex string
Syntax Error (2441237): Illegal character <2e> in hex string
Syntax Error (2441238): Illegal character <51> in hex string
Syntax Error (2441240): Illegal character <5f> in hex string
Syntax Error (2441241): Illegal character <24> in hex string
Syntax Error (2441242): Illegal character <58> in hex string
Syntax Error (2441243): Illegal character <3a> in hex string
Syntax Error (2441244): Illegal character <3f> in hex string
Syntax Error (2441245): Illegal character <6b> in hex string
Syntax Error (2441246): Illegal character <23> in hex string
Syntax Error (2441247): Illegal character <2f> in hex string
Syntax Error (2441248): Illegal character <6d> in hex string
Syntax Error (2441249): Illegal character <73> in hex string
Syntax Error (2441250): Illegal character <6d> in hex string
Syntax Error (2441251): Illegal character <6d> in hex string
Syntax Error (2441252): Illegal character <51> in hex string
Syntax Error (2441253): Illegal character <2f> in hex string
Syntax Error (2441255): Illegal character <54> in hex string
Syntax Error (2441256): Illegal character <24> in hex string
Syntax Error (2441257): Illegal character <48> in hex string
Syntax Error (2441261): Illegal character <5b> in hex string
Syntax Error (2441262): Illegal character <70> in hex string
Syntax Error (2441263): Illegal character <2f> in hex string
Syntax Error (2441264): Illegal character <68> in hex string
Syntax Error (2441265): Illegal character <71> in hex string
Syntax Error (2441266): Illegal character <59> in hex string
Syntax Error (2441267): Illegal character <2c> in hex string
Syntax Error (2441268): Illegal character <3c> in hex string
Syntax Error (2441269): Illegal character <5f> in hex string
Syntax Error (2441270): Illegal character <57> in hex string
Syntax Error (2441273): Illegal character <50> in hex string
Syntax Error (2441275): Illegal character <69> in hex string
Syntax Error (2441276): Illegal character <40> in hex string
Syntax Error (2441278): Illegal character <4c> in hex string
Syntax Error (2441280): Illegal character <70> in hex string
Syntax Error (2441281): Illegal character <5d> in hex string
Syntax Error (2441282): Illegal character <4a> in hex string
Syntax Error (2441283): Illegal character <23> in hex string
Syntax Error (2441284): Illegal character <59> in hex string
Syntax Error (2441285): Illegal character <56> in hex string
Syntax Error (2441287): Illegal character <71> in hex string
Syntax Error (2441288): Illegal character <5e> in hex string
Syntax Error (2441290): Illegal character <4c> in hex string
Syntax Error (2441291): Illegal character <28> in hex string
Syntax Error (2441292): Illegal character <24> in hex string
Syntax Error (2441293): Illegal character <2f> in hex string
Syntax Error (2441294): Illegal character <55> in hex string
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Invalid XRef entry 0
Syntax Error (2437693): Missing 'endstream' or incorrect stream length
Syntax Error (2436161): Bad FCHECK in flate stream
Syntax Error: Embedded font file may be invalid
Syntax Error (2436088): Missing 'endstream' or incorrect stream length
Syntax Error (2435010): Bad FCHECK in flate stream
Because you are using reportlab==4.2.5
. Please force reportlab==4.0.9
.
Sorry, I noticed that just after commenting.
With
pip install reportlab==4.0.9 --force
I have a pdf with a readable text layout.
pdf2txt complains still about corrupted data
$ pdf2txt output.pdf -o out.txt
WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data
pdftotext output.pdf -
displays the text
Evince (pdf reader) complains a lot with "some font thing failed" when reading the pdf but search works
I have not validated other tools further, but you might want to have a look at https://github.com/stefan6419846/hocr-tools which fixes both the compatibility with recent reportlab versions and includes #178 which might fix some of these aspects.
I think my problem is related to accent support. The recognized text is in French and I cannot search accented letters in the output pdf create by hocr
for example: "formé" in the hocr file is "formeÄ" in the output pdf
I will try your fork
I am trying to generate a searchable pdf from a jpeg file and a hocr file with the help of hocr-pdf.
I have both files in the same folder.
hocr-pdf . > out.pdf
generates a pdf but I cannot search inside.Pdf reader (evince) says "some font thing failed" when displaying the file (I can see the image).
When I extract the text from the pdf
and out.txt contains (excerpt)
My hocr file is generated by kraken.
I read from kraken documentation
So I also tried with an ALTO file (still generated by Kraken), which I convert to hocr format with the help of ocr-fileformat. Same result.