maxpmaxp / pdfreader

Python API for PDF documents
MIT License
113 stars 26 forks source link

ValueError: Unexpected predictor 1 #109

Closed Vincent-Stragier closed 4 months ago

Vincent-Stragier commented 1 year ago

Hello,

I get the following error, when trying to extract some images (I installed the version of this repository, but I had the same version with the last PyPI release):

py -3.10 .\extract_images.py .\llama2.pdf .\images\llama2
Traceback (most recent call last):
  File "C:\Users\Vincent\Downloads\extract_all_images\extract_images.py", line 28, in images_from_viewer
    images.append(page_image.to_Pillow())
  File "C:\Users\Vincent\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfreader\pillow.py", line 88, in to_Pillow
    img = Image.frombytes(cs, size, bytes(self.filtered))
  File "C:\Users\Vincent\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfreader\utils.py", line 22, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "C:\Users\Vincent\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfreader\types\native.py", line 112, in filtered
    return apply_filter_multi(self.get('Filter'),
  File "C:\Users\Vincent\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfreader\types\native.py", line 56, in apply_filter_multi
    binary = apply_filter(fname, binary, params)
  File "C:\Users\Vincent\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfreader\filters\__init__.py", line 14, in apply_filter
    return decoder.decode(binary, params or {})
  File "C:\Users\Vincent\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfreader\filters\flate.py", line 23, in decode
    data = _remove_predictors(data, params.get("Predictor"), params.get("Columns"))
  File "C:\Users\Vincent\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfreader\filters\predictors.py", line 24, in _remove_predictors
    raise ValueError("Unexpected predictor {}".format(data[0]))
ValueError: Unexpected predictor 1
Failed to extract image on page 4

The code I used is here and I used this paper:

"""Extract all images from a PDF file."""
import argparse
import os
import traceback

from pdfreader import SimplePDFViewer

def images_from_viewer(viewer) -> list:
    """Yield all images from a PDF viewer.

    Args:
        viewer (SimplePDFViewer): A PDF viewer.

    Returns:
        list: A list of images.
    """
    images = []
    page_count = len(list(viewer.doc.pages()))

    for index, canvas in enumerate(viewer):
        print(f"On page {index + 1}/{page_count}", end="\r")
        page_images = canvas.images
        # print(f'Found {len(page_images)} images on page {index + 1}')

        for page_image in page_images.values():
            try:
                images.append(page_image.to_Pillow())
            except ValueError:
                traceback.print_exc()
                print(f"Failed to extract image on page {index + 1}")

    print()

    return images

def save_images(images: list, path: str) -> None:
    """Save images to a path.

    Args:
        images (list): A list of images.
        path (str): A path to save images to.
    """
    for index, image in enumerate(images):
        image.save(f"{path}_{index}.png", format="png")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)

    parser.add_argument("pdf_path", help="Path to PDF file")
    parser.add_argument("image_path", help="Path to save images to")

    args = parser.parse_args()

    pdf_path = args.pdf_path
    image_path = args.image_path

    # Ensure that the image path exists and create it if it doesn't
    parent_dir = os.path.dirname(image_path)
    os.makedirs(parent_dir, exist_ok=True)

    with open(pdf_path, "rb") as file:
        simple_viewer = SimplePDFViewer(file)
        extracted_images = images_from_viewer(simple_viewer)
        save_images(extracted_images, image_path)
maxpmaxp commented 4 months ago

@Vincent-Stragier fixed on #116, merged on master