jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Cannot close pdf file after calling page.extract_text() on it #446

Closed maciej44 closed 3 years ago

maciej44 commented 3 years ago

Describe the bug

A clear and concise description of what the bug is. I have problem closing file opened with pdfplumber.open() function. Whenever i call extract_text() on file object it seems like file is opened even out of with pdfplumber.open(): scope and I am able to call print(pdf.pages[1].extract_text()) for example. Calling close() on PDF object also doesn't help.
Simplified code:

Code to reproduce the problem

Paste it here, or attach a Python file.

import pdfplumber
import os

class Class():

    def method(self, file_url):

        with pdfplumber.open(file_url) as pdf:
            for page in pdf.pages:
                #logic in here
                page_text = page.extract_text()

        print(pdf.pages[1].extract_text())
        pdf.flush_cache()
        pdf.close()
        print(pdf.pages[1].extract_text())

def main():
    dir_path = os.path.dirname(os.path.realpath(__file__))
    file_path = dir_path + '/file.pdf'

    c = Class()
    c.method(file_path)

if __name__=="__main__":
    main()

Additional context

If I don't call page.extract_text() on PDF object inside for loop like:

        with pdfplumber.open(file_url) as pdf:
            for page in pdf.pages:
                #logic in here
                pass

it closes normally and print(pdf.pages[1].extract_text()) throws ValueError: seek of closed file

Environment

samkit-jain commented 3 years ago

Hi @MackoDS Appreciate your interest in the library. I am not sure if this can be considered a bug. The reason you are able to run text extraction outside the with block is that in the for loop, it is going through all the pages and calling the .extract_text() method on it which fills in all the necessary variables in the class instance. This results in the methods that rely on those to work properly. In the latter case, when you don't call the method, the variables are not filled in and that's why when you call the .extract_text() method outside the with block, it fails as it is then trying to read the file object which has already been closed. Similarly, if you do something like

with pdfplumber.open(file_url) as pdf:
    for page in pdf.pages:
        page_text = page.extract_text()
        break  # Break after reading the first page.

print(pdf.pages[0].extract_text())  # This will work as the page has already been read.
print(pdf.pages[1].extract_text())  # This will not work.

the text extraction on page 0 will work but on page 1 not.

maciej44 commented 3 years ago

Thank You @samkit-jain, I appreciate Your reply. I understand what You are saying here, but doesn't it lead to the situation where we are processing very big file for example and it stays in memory even we don't need it anymore?

samkit-jain commented 3 years ago

Good point @MackoDS In such cases to free up the memory, you can try using the garbage collector gc as

import gc

del pdf  # This would ensure that the pdf object is cleaned up by the garbage collector.
gc.collect()