Closed alexaryn closed 3 weeks ago
Thanks for the report. Feel free to submit a corresponding PR and test.
I do not agree with your comment saying that the stream should not be closed : once you are calling write()
from the writer point of view the stream has nothing more to do.
if you wan to keep it open, this should work:
def select_pdf_pages(input: BinaryIO, out: BinaryIO, page_list: list[int]) -> None:
input.seek(0)
with pypdf.PdfReader(input) as pdf_reader:
pdf_writer = pypdf.PdfWriter()
for page_num in page_list:
pdf_writer.add_page(pdf_reader.pages[page_num - 1])
pdf_writer.write(out)
@pubpub-zz I think there are two arguments against keeping the current behavior.
pdf_writer.write(out)
should be consistent in its effect without looking back to see how pdf_writer
was instantiated.pdf_writer.close()
. As a rule, context managers clean up just the object instantiated in the with
statement.I'd also counter the supposition that the stream is useless after write()
. It's up to the caller to decide if it has further uses for the stream. One common pattern in the unix world is to hold open a file that has been unlinked, useful for auto-cleaning temporary files.
I agree with @alexaryn in that responsibility of closing a stream should rest purely on the caller of PdfWriter
, and not PdfWriter
itself. Calling pdf_writer.close()
or using the context manager should only clean up the resources that PdfWriter
itself has created as part of its operation. I apologize for not catching the current behavior when I reviewed #1193. 😬
Looking at the code, there is also the inefficiency where PdfWriter.__init__
is called twice. It's also permissible to call PdfWriter
with a PdfReader
as the first argument (to simplify cloning), but in doing so with contexts, will hit the following error:
>>> with PdfReader("./sample-files/001-trivial/minimal-document.pdf") as reader:
... with PdfWriter(reader) as writer:
... print(writer.fileobj)
...
__init__
__enter__
__init__
<pypdf._reader.PdfReader object at 0x10304e970>
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "/Users/mpeveler/code/github/pypdf/pypdf/_writer.py", line 373, in __exit__
self.write(self.fileobj)
File "/Users/mpeveler/code/github/pypdf/pypdf/_writer.py", line 1396, in write
self.write_stream(stream)
File "/Users/mpeveler/code/github/pypdf/pypdf/_writer.py", line 1367, in write_stream
object_positions, free_objects = self._write_pdf_structure(stream)
File "/Users/mpeveler/code/github/pypdf/pypdf/_writer.py", line 1500, in _write_pdf_structure
stream.write(self.pdf_header.encode() + b"\n")
AttributeError: 'PdfReader' object has no attribute 'write'
I'm not sure what the correct behavior here should be, especially with regards to cloning. Like, if I call:
with PdfWriter("foo.pdf") as writer:
and foo.pdf
exists, do I expect that it'll be cloned into the writer? I'd think no? However, I think this does force the second __init__
call to be made as there's no way to tell in the first __init__
that we're in a context. I'd think this could be confusing for end users, but I guess not since no one has written in about this behavior.
@MasterOdin / @alexaryn, I've created a new issue with the @MasterOdin's last post as there is some issues there
Calling
PdfWriter.write(fileobj)
unexpectedly closedfileobj
causing my program to crash later when it tried to dofileobj.seek(0)
.Environment
Which environment were you using when you encountered the problem?
Linux amd64; Python 3.11 and 3.12, but it doesn't matter, as it's a logic bug in the code.
Code
This code will trigger the issue with any PDF input:
After calling this,
out
is closed. It should not be.Traceback
n/a
Explanation
The problem arises here: https://github.com/py-pdf/pypdf/blob/dd3999251bc9a91d18b87810f6a30df956c797c9/pypdf/_writer.py#L1396 which seems like a special case for when
write()
is called from__exit__()
. However, just because thePdfWriter
was used in awith
...as
statement doesn't mean that the stream passed intowrite()
is the internalself.fileobj
. If it belongs to the caller, it should not be messed with.Potential Solution
Rather than closing in
write()
, why not close afterward in__exit__()
?Partial Workaround
write_stream()
does not have the problematic logic.