Closed jrochkind closed 1 year ago
OK, I think there may in fact be no way to run recode_pdf
on a single-page?
If I use --from-imagestack 'some_dir/*'
, if that dir only has one image in it -- I still get IndexError: list index out of range
If I use eg --from-imagestack /tmp/scan.tiff
(an example from README), I also get IndexError: list index out of range
.
If I use --from-imagestack 'some_dir/*'
on a directory that has at least two image files in it -- it works.
Is there no way to run on a PDF with only image? Is this a bug?
I believe the problem is that this hocr
file contains two ocr_page
elements, which can happen if you run Tesseract on a TIFF file that contains two images - this seems to be the case here. A tiff with an embedded thumbnail is also seen as two images.
If you tell Tesseract to use only the first image, this problem will go away, try passing -c tessedit_page_number=0
.
The --from-imagestack
takes a glob as argument, so it can definitely deal with a single image - the problem here occurs because the hOCR file claims to contain two pages. :-)
Using:
tesseract -c tessedit_page_number=0 insuring_15.tiff - hocr > /tmp/test.hocr
recode_pdf -v --from-imagestack insuring_15.tiff --hocr-file test.hocr -o out.pdf
Oh right, that makes sense!
Thank you for helping me figure this out, and suggesting the tesseract command to only take the first one!
Now that you mention it, I recall us having problems before with these embedded thumbnails that our production process winds up embedding for reasons we don't really know. But I had forgotten about that, and hadn't noticed the double page in the HOCR.
It seems like it might be better to get a more clear error message like "number of pages in HOCR does not match number of images provided" -- but this might not be the highest priority in archive-pdf-tools.
I will close this issue. Thanks so much for your help, and for providing this code!
Thanks for the suggestion, I have committed said error message in commit 11661f2. (Untested)
I am trying this tool out for the first time, by trying to take a single TIFF image and a single HOCR file created by tesseract, and combine them into a PDF.
The README example seems to show you can pass a single TIFF image as the value to argument
--from-imagestack
? i don't know if that's the problem.You can find the files I used as input here:
The TIFF is fairly large (60MB), I don't know if that's an issue.
I appreciate any advice!