internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
86 stars 13 forks source link

Error with hocr-files from Tesseract #30

Closed rmast closed 2 years ago

rmast commented 2 years ago

When Tesseract generates this HOCR-file img.zip

I get this error:

recode_pdf --from-imagestack ../210923-005.tif --hocr-file ~/img.hocr -o /tmp/outf.pdf --bg-downsample 3 -v --dpi 300 --fg-compression-flags '-slope 45000' --mask-compression jbig2
     MMX
     SSE
     SSE2
     SSE3
     SSSE3
     SSE41
     POPCNT
     SSE42
     AVX
     F16C
     XOP
     FMA4
     FMA3
Creating text only PDF
Starting page generation at 2021-11-28T10:59:56.133494
Traceback (most recent call last):
  File "/usr/local/bin/recode_pdf", line 4, in <module>
    __import__('pkg_resources').run_script('archive-pdf-tools==1.4.9', 'recode_pdf')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 667, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1463, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 262, in <module>
    res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
  File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 1070, in recode
    create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
  File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 189, in create_tess_textonly_pdf
    imgfile = image_files[idx]
IndexError: list index out of range

For the first 5 pages there was no issue with the same command, it's only this page, so the hocr coming from Tesseract contains something not allowed.|

MerlijnWajer commented 2 years ago

I don't immediately see something wrong with the hOCR file, but I'll take another look. What Tesseract version did you use?

A few things I should mention I think in case it's relevant:

This way you can create a PDF with multiple pages in one go.

MerlijnWajer commented 2 years ago

Could it be that the path to the image is perhaps not correct, leading the glob() call to return zero files?

rmast commented 2 years ago

If I use the 6 images in the original scanned PDF and run this command: recode_pdf -P 'Documenten/210923 nog 3 zonder haartje.pdf' --hocr-file 6x4.hocr -o 210923.pdf

Only the last page (the page with the strange HOCR, coming from tesseract 5.0.0-beta-20210916-69-g81e9e) gives this error:

Skipping word with low confidence. Skipping word with low confidence. Traceback (most recent call last): File "/usr/local/bin/recode_pdf", line 4, in import('pkg_resources').run_script('archive-pdf-tools==1.4.9', 'recode_pdf') File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script self.require(requires)[0].run_script(script_name, ns) File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script exec(code, namespace, namespace) File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 262, in res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file, File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 1070, in recode create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf, File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 290, in create_tess_textonly_pdf render.AddImageHandler(word_data, width, height, ppi=ppi, hocr_ppi=hocr_dpi) File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/pdfrenderer.py", line 418, in AddImageHandler pdftext = self.GetPDFTextObjects(word_data, width, height, ppi, hocr_ppi=hocr_ppi) File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/pdfrenderer.py", line 183, in GetPDFTextObjects for char in word['text']: TypeError: 'NoneType' object is not iterable

MerlijnWajer commented 2 years ago

I think I might see what is going on:

      <span class='ocrx_word' id='word_1_44' title='bbox 1794 418 1972 456; x_wconf 16'><em>Grondslagrr</em></span>

Contains <em> - I haven't seen that before in hOCR files I think. I might need to fix this in archive-hocr-tools. Did you use any special Tesseract runtime options?

MerlijnWajer commented 2 years ago

Tesseract definitely can insert it: https://github.com/tesseract-ocr/tesseract/blob/main/src/api/hocrrenderer.cpp#L297

but I am not sure what options causes Tesseract to detect the font bold and italic properties. I usually add set this to 1, but I haven't seen the <em> property at all on a few millions of books.

hocr_font_info  0   Add font info to hocr output

Maybe it's because I also specify hocr_char_boxes=1.

rmast commented 2 years ago

I just used your Tesseract-statement for this img.hocr. I even left out -l nld and it gave resolution 72 72 as well. I use the language files that are supporting legacy and new ocr.

MerlijnWajer commented 2 years ago

Ok, thanks. It would be helpful if you can provide the full Tesseract command line, but I'm working on a fix for this problem now regardless.

rmast commented 2 years ago

tesseract --dpi 300 210923-005.tif - hocr> img5.hocr

I'm now compiling Tesseract 4.1.3 to see whether that also gives this element.

MerlijnWajer commented 2 years ago

I think the reason this doesn't happen in my normal pipeline is indeed because I specify -c hocr_char_boxes=1 to Tesseract, which causes the hOCR parser to read per-character, which doesn't contain this word information (those are found with xpath). And my own tooling to transform a character-based file to a word-based strips this information, so I never encounter it on archive.org.

MerlijnWajer commented 2 years ago

tesseract --dpi 300 210923-005.tif - hocr> img5.hocr

I'm now compiling Tesseract 4.1.3 to see whether that also gives this element.

No need to do that, I'm pretty sure it will, I'll fix this today, thanks.

MerlijnWajer commented 2 years ago

Meanwhile you could run Tesseract with -c hocr_char_boxes=1 to work around the problem, or manually remove the <em> and </em> tags.

MerlijnWajer commented 2 years ago

Can you try pip install archive-hocr-tools==1.1.12 - it should be fixed there.

MerlijnWajer commented 2 years ago

I will update the dependency of archive-pdf-tools on archive-hocr-tools in the next few days.

rmast commented 2 years ago

Can you try pip install archive-hocr-tools==1.1.12 - it should be fixed there.

Ok, I tried it. The second error now appears already with the first statement, so they appear to be different issues:

recode_pdf --from-imagestack 210923-005.tif --hocr-file img5.hocr -o outf.pdf --bg-downsample 3 -v --dpi 300 --fg-compression-flags '-slope 45000' --mask-compression jbig2 MMX SSE SSE2 SSE3 POPCNT Creating text only PDF Starting page generation at 2021-11-28T14:53:47.937466 Traceback (most recent call last): File "/usr/local/bin/recode_pdf", line 4, in import('pkg_resources').run_script('archive-pdf-tools==1.4.9', 'recode_pdf') File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script self.require(requires)[0].run_script(script_name, ns) File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script exec(code, namespace, namespace) File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 262, in res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file, File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 1070, in recode create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf, File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 290, in create_tess_textonly_pdf render.AddImageHandler(word_data, width, height, ppi=ppi, hocr_ppi=hocr_dpi) File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/pdfrenderer.py", line 418, in AddImageHandler pdftext = self.GetPDFTextObjects(word_data, width, height, ppi, hocr_ppi=hocr_ppi) File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/pdfrenderer.py", line 100, in GetPDFTextObjects for char in word['text']: TypeError: 'NoneType' object is not iterable

rmast commented 2 years ago

Adding -c hocr_char_boxes=1 does solve this second issue by the way.

MerlijnWajer commented 2 years ago

Oh, I see what the problem is, I only fixed hocr_page_to_word_data_fast in the hocr package, and the pdf package uses hocr_page_to_word_data. I'll fix that as well momentarily.

MerlijnWajer commented 2 years ago

Mind trying with pip install archive-hocr-tools==1.1.13 ?

rmast commented 2 years ago

No difference as far as I can see. Does it matter what A4 picture you use to interpret the hocr? Should I try to anonimize a picture and see if it still has this issue?

MerlijnWajer commented 2 years ago

Hm, it works for me. Yes, please anonimise a image.

rmast commented 2 years ago

Documents.zip recode_pdf --from-imagestack 210923-005a.tif --hocr-file img5a.hocr -o outfa.pdf --bg-downsample 3 -v --dpi 300 --fg-compression-flags '-slope 45000' --mask-compression jbig2 MMX SSE SSE2 SSE3 POPCNT Creating text only PDF Starting page generation at 2021-11-28T16:09:09.735919 Traceback (most recent call last): File "/usr/local/bin/recode_pdf", line 4, in import('pkg_resources').run_script('archive-pdf-tools==1.4.9', 'recode_pdf') File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script self.require(requires)[0].run_script(script_name, ns) File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script exec(code, namespace, namespace) File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 262, in res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file, File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 1070, in recode create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf, File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 290, in create_tess_textonly_pdf render.AddImageHandler(word_data, width, height, ppi=ppi, hocr_ppi=hocr_dpi) File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/pdfrenderer.py", line 418, in AddImageHandler pdftext = self.GetPDFTextObjects(word_data, width, height, ppi, hocr_ppi=hocr_ppi) File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/pdfrenderer.py", line 100, in GetPDFTextObjects for char in word['text']: TypeError: 'NoneType' object is not iterable

I did pip3 install as well as sudo pip3 install with the 1.1.13 and also changed the version in requirements.txt to do python3 setup.py build and install. That was the same as I did for the 1.1.12 version, but that made a difference.

MerlijnWajer commented 2 years ago

It works ok for me:

$ recode_pdf --from-imagestack /tmp/210923-005a.tif --hocr-file /tmp/img5a.hocr -o outfa.pdf --bg-downsample 3 -v --dpi 300 --fg-compression-flags '-slope 45000' --mask-compression jbig2
     MMX
     SSE
     SSE2
     SSE3
     SSSE3
     SSE41
     POPCNT
     SSE42
     AVX
     F16C
     FMA3
     AVX2
Creating text only PDF
Starting page generation at 2021-11-28T16:21:49.148181
Finished page generation at 2021-11-28T16:21:49.158080
Creating text pages took 0.0099 seconds
Inserting (and compressing) images
Converting with image mode: 2
MRC time breakdown: {'image_load': 0, 'grey_conversion': 100, 'hocr_mask_gen': 17, 'est_1': 27, 'threshold': 100, 'fast_denoise': 5, 'mask_jbig2': 58, 'fg_partial_blur': 237, 'fg_jp2': 108, 'bg_partial_blur': 235, 'bg_downsample': 82, 'bg_jp2': 42, 'page_image_insertion': 0}
Saving PDF now
Processed 1 pages at 1.13 seconds/page
Compression ratio: 39.787262

It should work with 1.1.13 I think - is that what you are reporting now as well?

MerlijnWajer commented 2 years ago

I will of course have to update the archive-pdf-tools dependency on the new hOCR one - that's why I suggested the one off pip install command earlier, I'll do that later today, I just want to make sure I'm not breaking anything else. :-)

rmast commented 2 years ago

It might be a dependency issue. I'm quite new at python.

Outlook voor Android downloadenhttps://aka.ms/ghei36


From: Merlijn Wajer @.> Sent: Sunday, November 28, 2021 5:21:47 PM To: internetarchive/archive-pdf-tools @.> Cc: rmast @.>; Author @.> Subject: Re: [internetarchive/archive-pdf-tools] Error with hocr-files from Tesseract (Issue #30)

I will of course have to update the archive-pdf-tools dependency on the new hOCR one - that's why I suggested the one off pip install command earlier, I'll do that later today, I just want to make sure I'm not breaking anything else. :-)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/internetarchive/archive-pdf-tools/issues/30#issuecomment-981112548, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZPZ5VYJIQEYUXTFA4ZIETUOJJJXANCNFSM5I5FPNFA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

MerlijnWajer commented 2 years ago

I have just uploaded version 1.4.10 - I think that should fix your problem. Please let me know if you run into other problems or have suggestions / feedback. Thanks!

rmast commented 2 years ago

I completely rebuilt a Mint Cinnamon 20.2 image to get it running. There is a little difference in the reached compression ratio, with the Gamera binarizer this behaviour signaled a memory leak, however this compression ratio is constant with my setup:

recode_pdf --from-imagestack 210923-005a.tif --hocr-file img5a.hocr -o outfa.pdf --bg-downsample 3 -v --dpi 300 --fg-compression-flags '-slope 45000' --mask-compression jbig2
     MMX
     SSE
     SSE2
     SSE3
     POPCNT
Creating text only PDF
Starting page generation at 2021-11-29T22:20:57.639133
Finished page generation at 2021-11-29T22:20:57.798189
Creating text pages took 0.1592 seconds
Inserting (and compressing) images
Converting with image mode: 2
MRC time breakdown: {'image_load': 356, 'grey_conversion': 38, 'hocr_mask_gen': 69, 'est_1': 332, 'threshold': 421, 'fast_denoise': 13, 'mask_jbig2': 181, 'fg_partial_blur': 918, 'fg_jp2': 994, 'bg_partial_blur': 810, 'bg_downsample': 263, 'bg_jp2': 98, 'page_image_insertion': 4}
Saving PDF now
Processed 1 pages at 5.98 seconds/page
Compression ratio: 39.650476

I somewhat compressed my install-activities on the fresh image:

sudo apt install git
sudo apt-get install build-essential
sudo apt-get install autotools-dev
sudo apt-get install automake
sudo apt-get install libtool
sudo apt-get install libpng-dev
sudo apt install python3-pip
pip3 install Cython
pip3 install numpy
sudo pip3 install numpy
sudo pip3 install Cython

wget https://github.com/internetarchive/archive-pdf-tools/files/7613879/Documents.zip
unzip Documents.zip
wget https://kakadusoftware.com/wp-content/uploads/KDU805_Demo_Apps_for_Linux-x86-64_200602.zip
unzip KDU805_Demo_Apps_for_Linux-x86-64_200602.zip 
sudo ln KDU805_Demo_Apps_for_Linux-x86-64_200602/kdu_expand /home/oem/.local/bin
sudo ln KDU805_Demo_Apps_for_Linux-x86-64_200602/kdu_compress /home/oem/.local/bin
sudo ln KDU805_Demo_Apps_for_Linux-x86-64_200602/libkdu_v80R.so /usr/local/lib
vi .bashrc
export PATH=$PATH:/home/oem/.local/bin
git clone https://github.com/DanBloomberg/leptonica.git
cd leptonica/
git checkout 1.74.4
./autobuild 
./configure
make
sudo make install
cd ..
git clone https://github.com/agl/jbig2enc.git
cd jbig2enc/
./autogen.sh 
./configure
make
sudo make install
cd ..
git clone https://github.com/internetarchive/archive-pdf-tools.git
cd archive-pdf-tools
python3 setup.py build
sudo python3 setup.py install
cd ..