Closed Urbener closed 1 year ago
Hi @Urbener, and thank you for flagging. Thanks, too, for the clear description, example file, and code to reproduce. Exactly the kind of issue I like to see!
I agree with your diagnosis of the issue / code to blame. I have some potential solutions in mind, which I'll test. Will keep you updated here.
Thanks again! Should be fixed in https://github.com/jsvine/pdfplumber/commit/30a52cb and now available in v0.10.2
It's my pleasure to help improve such a useful project. It should be me thanking you for creating it and your continued support.
Thanks!
When working with a
ZipExtFile
, callingpage.to_image()
ends up throwing aFileNotFoundError
, as it's treating the file name inside the zip file as a regular, filesystem-backed file. I think this will apply to other stream types as well, but I haven't been able to test it.Calling
pypdfium2.PdfDocument._process_page()
directly works as expected, so I think the problem can be traced here.However, calling
pdfplumber.open(repair=True)
does work with some files, and the types received by theget_page_image()
change: withrepair=True
, it gets a_io.BytesIO
object, while without it, it gets aZipExtFile
.Sample ZIP and PDF file: reproducer.zip
Environment: