jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.48k stars 658 forks source link

.to_image() treats a stream as a regular file #948

Closed Urbener closed 1 year ago

Urbener commented 1 year ago

When working with a ZipExtFile, calling page.to_image() ends up throwing a FileNotFoundError, as it's treating the file name inside the zip file as a regular, filesystem-backed file. I think this will apply to other stream types as well, but I haven't been able to test it.

Calling pypdfium2.PdfDocument._process_page() directly works as expected, so I think the problem can be traced here.

However, calling pdfplumber.open(repair=True) does work with some files, and the types received by the get_page_image() change: with repair=True, it gets a _io.BytesIO object, while without it, it gets a ZipExtFile.

# reproducer.py
from zipfile import ZipFile

import pdfplumber

with ZipFile('reproducer.zip') as zip_file:
    with zip_file.open('dummy.pdf') as pdf_file:
        with pdfplumber.open(pdf_file) as pdf:
            page = pdf.pages[0]
            im = page.to_image()

Sample ZIP and PDF file: reproducer.zip

Environment:

jsvine commented 1 year ago

Hi @Urbener, and thank you for flagging. Thanks, too, for the clear description, example file, and code to reproduce. Exactly the kind of issue I like to see!

I agree with your diagnosis of the issue / code to blame. I have some potential solutions in mind, which I'll test. Will keep you updated here.

jsvine commented 1 year ago

Thanks again! Should be fixed in https://github.com/jsvine/pdfplumber/commit/30a52cb and now available in v0.10.2

Urbener commented 1 year ago

It's my pleasure to help improve such a useful project. It should be me thanking you for creating it and your continued support.

Thanks!