internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
86 stars 13 forks source link

IndexError: list index out of range (single TIFF file) #61

Closed jrochkind closed 1 year ago

jrochkind commented 1 year ago
recode_pdf --version
internetarchivepdf 1.5.2

I am trying this tool out for the first time, by trying to take a single TIFF image and a single HOCR file created by tesseract, and combine them into a PDF.

The README example seems to show you can pass a single TIFF image as the value to argument --from-imagestack? i don't know if that's the problem.

$ recode_pdf -v --from-imagestack insuring_15.tiff --hocr-file insuring_15.tesseract.hocr -o out.pdf
     NEON
     NEON_FP16
     NEON_VFPV4
     ASIMD
     FPHP
     ASIMDHP
Creating text only PDF
Starting page generation at 2023-03-08T16:43:53.433673
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/recode_pdf", line 299, in <module>
    res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/internetarchivepdf/recode.py", line 639, in recode
    create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/internetarchivepdf/recode.py", line 134, in create_tess_textonly_pdf
    imgfile = image_files[idx]
IndexError: list index out of range

You can find the files I used as input here:

The TIFF is fairly large (60MB), I don't know if that's an issue.

I appreciate any advice!

jrochkind commented 1 year ago

OK, I think there may in fact be no way to run recode_pdf on a single-page?

If I use --from-imagestack 'some_dir/*', if that dir only has one image in it -- I still get IndexError: list index out of range

If I use eg --from-imagestack /tmp/scan.tiff (an example from README), I also get IndexError: list index out of range.

If I use --from-imagestack 'some_dir/*' on a directory that has at least two image files in it -- it works.

Is there no way to run on a PDF with only image? Is this a bug?

MerlijnWajer commented 1 year ago

I believe the problem is that this hocr file contains two ocr_page elements, which can happen if you run Tesseract on a TIFF file that contains two images - this seems to be the case here. A tiff with an embedded thumbnail is also seen as two images.

If you tell Tesseract to use only the first image, this problem will go away, try passing -c tessedit_page_number=0.

The --from-imagestack takes a glob as argument, so it can definitely deal with a single image - the problem here occurs because the hOCR file claims to contain two pages. :-)

MerlijnWajer commented 1 year ago

Using:

tesseract -c tessedit_page_number=0 insuring_15.tiff - hocr > /tmp/test.hocr
recode_pdf -v --from-imagestack insuring_15.tiff --hocr-file test.hocr -o out.pdf

out.pdf

jrochkind commented 1 year ago

Oh right, that makes sense!

Thank you for helping me figure this out, and suggesting the tesseract command to only take the first one!

Now that you mention it, I recall us having problems before with these embedded thumbnails that our production process winds up embedding for reasons we don't really know. But I had forgotten about that, and hadn't noticed the double page in the HOCR.

It seems like it might be better to get a more clear error message like "number of pages in HOCR does not match number of images provided" -- but this might not be the highest priority in archive-pdf-tools.

I will close this issue. Thanks so much for your help, and for providing this code!

MerlijnWajer commented 1 year ago

Thanks for the suggestion, I have committed said error message in commit 11661f2. (Untested)