Closed maciej44 closed 3 years ago
Hi @MackoDS Appreciate your interest in the library. I am not sure if this can be considered a bug. The reason you are able to run text extraction outside the with
block is that in the for loop, it is going through all the pages and calling the .extract_text()
method on it which fills in all the necessary variables in the class instance. This results in the methods that rely on those to work properly. In the latter case, when you don't call the method, the variables are not filled in and that's why when you call the .extract_text()
method outside the with
block, it fails as it is then trying to read the file object which has already been closed. Similarly, if you do something like
with pdfplumber.open(file_url) as pdf:
for page in pdf.pages:
page_text = page.extract_text()
break # Break after reading the first page.
print(pdf.pages[0].extract_text()) # This will work as the page has already been read.
print(pdf.pages[1].extract_text()) # This will not work.
the text extraction on page 0 will work but on page 1 not.
Thank You @samkit-jain, I appreciate Your reply. I understand what You are saying here, but doesn't it lead to the situation where we are processing very big file for example and it stays in memory even we don't need it anymore?
Good point @MackoDS In such cases to free up the memory, you can try using the garbage collector gc
as
import gc
del pdf # This would ensure that the pdf object is cleaned up by the garbage collector.
gc.collect()
Describe the bug
A clear and concise description of what the bug is. I have problem closing file opened with
pdfplumber.open()
function. Whenever i callextract_text()
on file object it seems like file is opened even out ofwith pdfplumber.open():
scope and I am able to callprint(pdf.pages[1].extract_text())
for example. Callingclose()
on PDF object also doesn't help.Simplified code:
Code to reproduce the problem
Paste it here, or attach a Python file.
Additional context
If I don't call
page.extract_text()
on PDF object inside for loop like:it closes normally and
print(pdf.pages[1].extract_text())
throwsValueError: seek of closed file
Environment