Closed dhdaines closed 4 months ago
In theory there is a parameter autoclose=True
that should make pypdfium2
close the file (why this isn't the default, only Google knows...) but it doesn't really seem to work. Calling close
explicitly on the PdfDocument
does work, though. PR coming shortly.
Fixed by @dhdaines in https://github.com/jsvine/pdfplumber/pull/1090
In theory there is a parameter autoclose=True that should make pypdfium2 close the file (why this isn't the default, only Google knows...) but it doesn't really seem to work. Calling close explicitly on the PdfDocument does work, though.
Let me shine some light on this:
autoclose=True
is not default because we don't know whether the caller might want to continue using the FD after having closed the PdfDocument
.[^1]with
or try/finally
statement) -- then closing is deferred to garbage collection, which does not have to happen immediately after the object falls unreferenced. There can be an arbitrary delay until the finalizer is actually called. (I'm not aware of a way to prioritize garbage, but feel free to give me a pointer if you find some Python C API to do this.)autoclose
does not relieve you from the task to add an explicit close call; it just merges FD closing into PdfDocument
closing.Maybe we should just remove the parameter with the next major version to avoid confusion? WDYT?
[^1]: I believe it's a convention that an FD be managed by the caller that opened it, and that receivers should not close unless asked for.
- The reason why you perceived it "doesn't really work" is basically the same as when you open a file descriptor without explicitly closing it (e.g. via a
with
ortry/finally
statement) -- then closing is deferred to garbage collection, which does not have to happen immediately after the object falls unreferenced. There can be an arbitrary delay until the finalizer is actually called. (I'm not aware of a way to prioritize garbage, but feel free to give me a pointer if you find some Python C API for this.)- So
autoclose
does not relieve you from the task to add an explicit close call; it just merges FD closing intoPdfDocument
closing.Maybe we should just remove the parameter with the next major version to avoid confusion? WDYT?
Thanks for clarifying this - I feel silly for not remembering the bit about garbage collection, that's the reason why context managers exist after all, sorry about that!
So yes, autoclose
should ideally be replaced with __enter__
and __exit__
methods on PdfDocument
, I think.
No problem.
So yes, autoclose should ideally be replaced with enter and exit methods on PdfDocument, I think.
I guess you're right we should provide context manager functionality.
Though that doesn't conflict with autoclose -- in fact, I believe it could help merge two with
nesting layers into one.
... but maybe the current autoclose needs a better name.
Yes, context manager functionality would be great! pdfplumber
will still have to explicitly call close
, obviously (it really isn't your bug).
I believe it could help merge two with nesting layers into one.
Wait, I just remembered there are 2-in-1 with-blocks, of course:
>>> import pypdfium2 as pdfium
>>>
>>> class PdfWithCtx (pdfium.PdfDocument):
... def __enter__(self):
... return self
... def __exit__(self, *_):
... self.close()
...
>>> with open(".../doc.pdf", "rb") as fh, PdfWithCtx(fh) as pdf:
... print(pdf[0].get_textpage().get_text_bounded())
So I guess autoclose is more or less obsolete. Thinking about it a second time, python should also close the buffer on garbage collection on its own, so supposedly there's no real advantage in binding it into our finalizer.
I think I might remove the parameter with the next major version (if that ever happens 😅).
Describe the bug
When extracting images from a PDF file with a large number of pages, we eventually run out of file descriptors on Mac. This probably leaks file descriptors everywhere but Linux has less strict ulimits. It may also be related to #1072 as Windows has silly restrictions on opening the same file multiple times.
This is because
pypdfium2
is holding onto a file descriptor somewhere. The exception thrown by the code below:Code to reproduce the problem
PDF file
https://ville.sainte-adele.qc.ca/upload/documents/20231213-Codification-administrative-Rgl-1314-2021-Z.pdf (485 pages, enough to hit the ulimit on a Mac!)
Environment
develop
branch (commit https://github.com/jsvine/pdfplumber/commit/07d9997ee587723c32e2178be65eea584102bf58)