Closed rmast closed 2 years ago
I don't immediately see something wrong with the hOCR file, but I'll take another look. What Tesseract version did you use?
A few things I should mention I think in case it's relevant:
hocr-combine-stream
tool to combine hOCR files: https://archive-hocr-tools.readthedocs.io/en/latest/#hocr-combine-stream`--from-imagestack
accepts a glob as argument (just single quote it, like so: --from-imagestack imgs*.tif'
This way you can create a PDF with multiple pages in one go.
Could it be that the path to the image is perhaps not correct, leading the glob()
call to return zero files?
If I use the 6 images in the original scanned PDF and run this command: recode_pdf -P 'Documenten/210923 nog 3 zonder haartje.pdf' --hocr-file 6x4.hocr -o 210923.pdf
Only the last page (the page with the strange HOCR, coming from tesseract 5.0.0-beta-20210916-69-g81e9e) gives this error:
Skipping word with low confidence.
Skipping word with low confidence.
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
I think I might see what is going on:
<span class='ocrx_word' id='word_1_44' title='bbox 1794 418 1972 456; x_wconf 16'><em>Grondslagrr</em></span>
Contains <em>
- I haven't seen that before in hOCR files I think. I might need to fix this in archive-hocr-tools
. Did you use any special Tesseract runtime options?
Tesseract definitely can insert it: https://github.com/tesseract-ocr/tesseract/blob/main/src/api/hocrrenderer.cpp#L297
but I am not sure what options causes Tesseract to detect the font bold and italic properties. I usually add set this to 1, but I haven't seen the <em>
property at all on a few millions of books.
hocr_font_info 0 Add font info to hocr output
Maybe it's because I also specify hocr_char_boxes=1
.
I just used your Tesseract-statement for this img.hocr. I even left out -l nld and it gave resolution 72 72 as well. I use the language files that are supporting legacy and new ocr.
Ok, thanks. It would be helpful if you can provide the full Tesseract command line, but I'm working on a fix for this problem now regardless.
tesseract --dpi 300 210923-005.tif - hocr> img5.hocr
I'm now compiling Tesseract 4.1.3 to see whether that also gives this element.
I think the reason this doesn't happen in my normal pipeline is indeed because I specify -c hocr_char_boxes=1
to Tesseract, which causes the hOCR parser to read per-character, which doesn't contain this word information (those are found with xpath
). And my own tooling to transform a character-based file to a word-based strips this information, so I never encounter it on archive.org.
tesseract --dpi 300 210923-005.tif - hocr> img5.hocr
I'm now compiling Tesseract 4.1.3 to see whether that also gives this element.
No need to do that, I'm pretty sure it will, I'll fix this today, thanks.
Meanwhile you could run Tesseract with -c hocr_char_boxes=1
to work around the problem, or manually remove the <em>
and </em>
tags.
Can you try pip install archive-hocr-tools==1.1.12
- it should be fixed there.
I will update the dependency of archive-pdf-tools
on archive-hocr-tools
in the next few days.
Can you try
pip install archive-hocr-tools==1.1.12
- it should be fixed there.
Ok, I tried it. The second error now appears already with the first statement, so they appear to be different issues:
recode_pdf --from-imagestack 210923-005.tif --hocr-file img5.hocr -o outf.pdf --bg-downsample 3 -v --dpi 300 --fg-compression-flags '-slope 45000' --mask-compression jbig2
MMX
SSE
SSE2
SSE3
POPCNT
Creating text only PDF
Starting page generation at 2021-11-28T14:53:47.937466
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
Adding -c hocr_char_boxes=1 does solve this second issue by the way.
Oh, I see what the problem is, I only fixed hocr_page_to_word_data_fast
in the hocr package, and the pdf package uses hocr_page_to_word_data
. I'll fix that as well momentarily.
Mind trying with pip install archive-hocr-tools==1.1.13
?
No difference as far as I can see. Does it matter what A4 picture you use to interpret the hocr? Should I try to anonimize a picture and see if it still has this issue?
Hm, it works for me. Yes, please anonimise a image.
Documents.zip
recode_pdf --from-imagestack 210923-005a.tif --hocr-file img5a.hocr -o outfa.pdf --bg-downsample 3 -v --dpi 300 --fg-compression-flags '-slope 45000' --mask-compression jbig2
MMX
SSE
SSE2
SSE3
POPCNT
Creating text only PDF
Starting page generation at 2021-11-28T16:09:09.735919
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
I did pip3 install as well as sudo pip3 install with the 1.1.13 and also changed the version in requirements.txt to do python3 setup.py build and install. That was the same as I did for the 1.1.12 version, but that made a difference.
It works ok for me:
$ recode_pdf --from-imagestack /tmp/210923-005a.tif --hocr-file /tmp/img5a.hocr -o outfa.pdf --bg-downsample 3 -v --dpi 300 --fg-compression-flags '-slope 45000' --mask-compression jbig2
MMX
SSE
SSE2
SSE3
SSSE3
SSE41
POPCNT
SSE42
AVX
F16C
FMA3
AVX2
Creating text only PDF
Starting page generation at 2021-11-28T16:21:49.148181
Finished page generation at 2021-11-28T16:21:49.158080
Creating text pages took 0.0099 seconds
Inserting (and compressing) images
Converting with image mode: 2
MRC time breakdown: {'image_load': 0, 'grey_conversion': 100, 'hocr_mask_gen': 17, 'est_1': 27, 'threshold': 100, 'fast_denoise': 5, 'mask_jbig2': 58, 'fg_partial_blur': 237, 'fg_jp2': 108, 'bg_partial_blur': 235, 'bg_downsample': 82, 'bg_jp2': 42, 'page_image_insertion': 0}
Saving PDF now
Processed 1 pages at 1.13 seconds/page
Compression ratio: 39.787262
It should work with 1.1.13 I think - is that what you are reporting now as well?
I will of course have to update the archive-pdf-tools dependency on the new hOCR one - that's why I suggested the one off pip install command earlier, I'll do that later today, I just want to make sure I'm not breaking anything else. :-)
It might be a dependency issue. I'm quite new at python.
Outlook voor Android downloadenhttps://aka.ms/ghei36
From: Merlijn Wajer @.> Sent: Sunday, November 28, 2021 5:21:47 PM To: internetarchive/archive-pdf-tools @.> Cc: rmast @.>; Author @.> Subject: Re: [internetarchive/archive-pdf-tools] Error with hocr-files from Tesseract (Issue #30)
I will of course have to update the archive-pdf-tools dependency on the new hOCR one - that's why I suggested the one off pip install command earlier, I'll do that later today, I just want to make sure I'm not breaking anything else. :-)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/internetarchive/archive-pdf-tools/issues/30#issuecomment-981112548, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZPZ5VYJIQEYUXTFA4ZIETUOJJJXANCNFSM5I5FPNFA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I have just uploaded version 1.4.10
- I think that should fix your problem. Please let me know if you run into other problems or have suggestions / feedback. Thanks!
I completely rebuilt a Mint Cinnamon 20.2 image to get it running. There is a little difference in the reached compression ratio, with the Gamera binarizer this behaviour signaled a memory leak, however this compression ratio is constant with my setup:
recode_pdf --from-imagestack 210923-005a.tif --hocr-file img5a.hocr -o outfa.pdf --bg-downsample 3 -v --dpi 300 --fg-compression-flags '-slope 45000' --mask-compression jbig2
MMX
SSE
SSE2
SSE3
POPCNT
Creating text only PDF
Starting page generation at 2021-11-29T22:20:57.639133
Finished page generation at 2021-11-29T22:20:57.798189
Creating text pages took 0.1592 seconds
Inserting (and compressing) images
Converting with image mode: 2
MRC time breakdown: {'image_load': 356, 'grey_conversion': 38, 'hocr_mask_gen': 69, 'est_1': 332, 'threshold': 421, 'fast_denoise': 13, 'mask_jbig2': 181, 'fg_partial_blur': 918, 'fg_jp2': 994, 'bg_partial_blur': 810, 'bg_downsample': 263, 'bg_jp2': 98, 'page_image_insertion': 4}
Saving PDF now
Processed 1 pages at 5.98 seconds/page
Compression ratio: 39.650476
I somewhat compressed my install-activities on the fresh image:
sudo apt install git
sudo apt-get install build-essential
sudo apt-get install autotools-dev
sudo apt-get install automake
sudo apt-get install libtool
sudo apt-get install libpng-dev
sudo apt install python3-pip
pip3 install Cython
pip3 install numpy
sudo pip3 install numpy
sudo pip3 install Cython
wget https://github.com/internetarchive/archive-pdf-tools/files/7613879/Documents.zip
unzip Documents.zip
wget https://kakadusoftware.com/wp-content/uploads/KDU805_Demo_Apps_for_Linux-x86-64_200602.zip
unzip KDU805_Demo_Apps_for_Linux-x86-64_200602.zip
sudo ln KDU805_Demo_Apps_for_Linux-x86-64_200602/kdu_expand /home/oem/.local/bin
sudo ln KDU805_Demo_Apps_for_Linux-x86-64_200602/kdu_compress /home/oem/.local/bin
sudo ln KDU805_Demo_Apps_for_Linux-x86-64_200602/libkdu_v80R.so /usr/local/lib
vi .bashrc
export PATH=$PATH:/home/oem/.local/bin
git clone https://github.com/DanBloomberg/leptonica.git
cd leptonica/
git checkout 1.74.4
./autobuild
./configure
make
sudo make install
cd ..
git clone https://github.com/agl/jbig2enc.git
cd jbig2enc/
./autogen.sh
./configure
make
sudo make install
cd ..
git clone https://github.com/internetarchive/archive-pdf-tools.git
cd archive-pdf-tools
python3 setup.py build
sudo python3 setup.py install
cd ..
When Tesseract generates this HOCR-file img.zip
I get this error:
For the first 5 pages there was no issue with the same command, it's only this page, so the hocr coming from Tesseract contains something not allowed.|