internetarchive / archive-hocr-tools

Efficient hOCR tooling
Other
40 stars 9 forks source link

EPUB: option to continue past singular Kakadu errors #17

Closed scottbarnes closed 2 months ago

scottbarnes commented 2 months ago

This commit will optionally continue past images that Kakadu won't convert, if the --ignore-broken-images option is provided.

With respect to the errors this PR directly addresses, in both cases the images would not open in GIMP, nor do they show in BookReader. They seem to be corrupt and I cannot figure out how to salvage them.

See, e.g. page 12 here: https://archive.org/details/cu31924003577214/page/n11/mode/2up

Sample error:

Traceback (most recent call last):
  File "/usr/local/bin/hocr-to-epub", line 684, in
    EpubGenerator(args.infile, args.metafile, args.imagestack, args.scandata, args.outfile, use_kakadu=args.kakadu)
  File "/usr/local/bin/hocr-to-epub", line 294, in __init__
    self.generate()
  File "/usr/local/bin/hocr-to-epub", line 602, in generate
    cropped_image_filename = self.img_stack.crop_image(page_idx, box)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/bin/hocr-to-epub", line 116, in crop_image
    raise RuntimeError(
RuntimeError: Can't convert JP2 to TIFF: Command '['kdu_expand', '-num_threads', '1', '-i', '/item/temp.jp2', '-o', '/item/page_12.tiff']' returned non-zero exit status 255.

Underlying errors