internetarchive / archive-hocr-tools

Efficient hOCR tooling
Other
40 stars 9 forks source link

EPUB: convert images to 8 bit prior to saving as JPEG #16

Closed scottbarnes closed 2 months ago

scottbarnes commented 2 months ago

This commit fixes a pair of errors related to saving images as JPEG. In the first case, it addresses an error whereby a 16 bit TIFF would raise an OSError when trying to save as JPEG, owing to JPEG's 8 bit limitation.

In the second case, it addresses an error whereby the image is palette based, and the image won't convert to JPEG.

In both cases the fix is to convert with RGB so the JPEG can generate.

The solution here is more general in that grayscale images that have an alpha channel will have that removed, and anything that isn't grayscale or RGB will be converted to RGB.

Functionally grayscale RGB images with the same color on each channel will be converted to grayscale also.

The 16 bit error is as follows:

Traceback (most recent call last):
  File "/usr/local/bin/hocr-to-epub", line 672, in
    EpubGenerator(args.infile, args.metafile, args.imagestack, args.scandata, args.outfile, use_kakadu=args.kakadu)
  File "/usr/local/bin/hocr-to-epub", line 291, in __init__
    self.generate()
  File "/usr/local/bin/hocr-to-epub", line 590, in generate
    cropped_image_filename = self.img_stack.crop_image(page_idx, box)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/bin/hocr-to-epub", line 132, in crop_image
    region.save(output_filename)
  File "/usr/local/lib/python3.12/site-packages/PIL/Image.py", line 2568, in save
    save_handler(self, fp, filename)
  File "/usr/local/lib/python3.12/site-packages/PIL/JpegImagePlugin.py", line 642, in _save
    raise OSError(msg) from e
OSError: cannot write mode I;16 as JPEG

Sample item: https://archive.org/download/ColibriesdeMexi00Ariz

The palette error is as follows:

Traceback (most recent call last):
  File "/usr/local/bin/hocr-to-epub", line 684, in
    EpubGenerator(args.infile, args.metafile, args.imagestack, args.scandata, args.outfile, use_kakadu=args.kakadu)
  File "/usr/local/bin/hocr-to-epub", line 294, in __init__
    self.generate()
  File "/usr/local/bin/hocr-to-epub", line 602, in generate
    cropped_image_filename = self.img_stack.crop_image(page_idx, box)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/bin/hocr-to-epub", line 135, in crop_image
    region.save(output_filename)
  File "/usr/local/lib/python3.12/site-packages/PIL/Image.py", line 2568, in save
    save_handler(self, fp, filename)
  File "/usr/local/lib/python3.12/site-packages/PIL/JpegImagePlugin.py", line 642, in _save
    raise OSError(msg) from e
OSError: cannot write mode P as JPEG

Sample item: https://archive.org/download/jiltandothersto00readgoog