jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
5.99k stars 618 forks source link

`Page.to_image()` leaks file descriptors #1089

Closed dhdaines closed 4 months ago

dhdaines commented 5 months ago

Describe the bug

When extracting images from a PDF file with a large number of pages, we eventually run out of file descriptors on Mac. This probably leaks file descriptors everywhere but Linux has less strict ulimits. It may also be related to #1072 as Windows has silly restrictions on opening the same file multiple times.

This is because pypdfium2 is holding onto a file descriptor somewhere. The exception thrown by the code below:

Traceback (most recent call last):
  File "/Users/dhd/work/vdsa/alexi/download/pdfbug.py", line 5, in <module>
    img = page.to_image()
          ^^^^^^^^^^^^^^^
  File "/Users/dhd/work/vdsa/pdfplumber/pdfplumber/page.py", line 585, in to_image
    return PageImage(
           ^^^^^^^^^^
  File "/Users/dhd/work/vdsa/pdfplumber/pdfplumber/display.py", line 88, in __init__
    self.original = get_page_image(
                    ^^^^^^^^^^^^^^^
  File "/Users/dhd/work/vdsa/pdfplumber/pdfplumber/display.py", line 56, in get_page_image
    pdfium_page = pypdfium2.PdfDocument(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dhd/work/vdsa/venv/lib/python3.12/site-packages/pypdfium2/_helpers/document.py", line 62, in __init__
    input = input.expanduser().resolve()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/pathlib.py", line 1240, in resolve
    s = self._flavour.realpath(self, strict=strict)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen posixpath>", line 436, in realpath
  File "<frozen posixpath>", line 423, in abspath
OSError: [Errno 24] Too many open files

Code to reproduce the problem

import pdfplumber
with pdfplumber.open("20231213-Codification-administrative-Rgl-1314-2021-Z.pdf") as pdf:
    for page in pdf.pages:
        img = page.to_image()

PDF file

https://ville.sainte-adele.qc.ca/upload/documents/20231213-Codification-administrative-Rgl-1314-2021-Z.pdf (485 pages, enough to hit the ulimit on a Mac!)

Environment

dhdaines commented 5 months ago

In theory there is a parameter autoclose=True that should make pypdfium2 close the file (why this isn't the default, only Google knows...) but it doesn't really seem to work. Calling close explicitly on the PdfDocument does work, though. PR coming shortly.

jsvine commented 4 months ago

Fixed by @dhdaines in https://github.com/jsvine/pdfplumber/pull/1090

mara004 commented 3 months ago

In theory there is a parameter autoclose=True that should make pypdfium2 close the file (why this isn't the default, only Google knows...) but it doesn't really seem to work. Calling close explicitly on the PdfDocument does work, though.

Let me shine some light on this:

Maybe we should just remove the parameter with the next major version to avoid confusion? WDYT?

[^1]: I believe it's a convention that an FD be managed by the caller that opened it, and that receivers should not close unless asked for.

dhdaines commented 3 months ago
  • The reason why you perceived it "doesn't really work" is basically the same as when you open a file descriptor without explicitly closing it (e.g. via a with or try/finally statement) -- then closing is deferred to garbage collection, which does not have to happen immediately after the object falls unreferenced. There can be an arbitrary delay until the finalizer is actually called. (I'm not aware of a way to prioritize garbage, but feel free to give me a pointer if you find some Python C API for this.)
  • So autoclose does not relieve you from the task to add an explicit close call; it just merges FD closing into PdfDocument closing.

Maybe we should just remove the parameter with the next major version to avoid confusion? WDYT?

Thanks for clarifying this - I feel silly for not remembering the bit about garbage collection, that's the reason why context managers exist after all, sorry about that!

So yes, autoclose should ideally be replaced with __enter__ and __exit__ methods on PdfDocument, I think.

mara004 commented 3 months ago

No problem.

So yes, autoclose should ideally be replaced with enter and exit methods on PdfDocument, I think.

I guess you're right we should provide context manager functionality. Though that doesn't conflict with autoclose -- in fact, I believe it could help merge two with nesting layers into one. ... but maybe the current autoclose needs a better name.

dhdaines commented 3 months ago

Yes, context manager functionality would be great! pdfplumber will still have to explicitly call close, obviously (it really isn't your bug).

mara004 commented 3 months ago

I believe it could help merge two with nesting layers into one.

Wait, I just remembered there are 2-in-1 with-blocks, of course:

>>> import pypdfium2 as pdfium
>>>
>>> class PdfWithCtx (pdfium.PdfDocument):
...     def __enter__(self):
...             return self
...     def __exit__(self, *_):
...             self.close()
... 
>>> with open(".../doc.pdf", "rb") as fh, PdfWithCtx(fh) as pdf:
...     print(pdf[0].get_textpage().get_text_bounded())

So I guess autoclose is more or less obsolete. Thinking about it a second time, python should also close the buffer on garbage collection on its own, so supposedly there's no real advantage in binding it into our finalizer.

I think I might remove the parameter with the next major version (if that ever happens 😅).