After get_text_bounded, cannot MOVE file

CaptainPalapa commented 3 months ago

Checklist

[X] I confirm this is not a question or feature request. Otherwise, use the Discussions page.
[X] I confirm this is not an issue encountered with an installed build of pypdfium2, but about some other aspect of the project (specify below). Otherwise, use one of the package templates (PyPA/conda), even if you believe this is not a package-specific issue.
[X] I confirm this is not about an unofficial build of pypdfium2. We do not support third-party builds, and they are not eligible for a bug report.

Reason for Generic issue (keyword/topic)

File does not get closed

Description

For the "confirm not a build issue", I can't really confirm that. I'm really new to python, maybe I don't understand. Package is from: pip install pypdfium2

Here is the code:

import pypdfium2 as pdfium
def extract_pdf(self, pdf_file):
        pdf = pdfium.PdfDocument(pdf_file)
        n_pages = len(pdf)  # get the number of pages in the document
        text = ""
        for i in range(n_pages):
            page = pdf[i]  # load a page
            textpage = page.get_textpage()
            text = text + textpage.get_text_bounded()

        return text

After this function completes, I attempt to move the file from a /incoming to a /processed folder, but I get:

  File ["...]FileService.py", line 44, in move_file
    os.rename(source, destination)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '/import/incoming/20200407RVWB2851342013171555418724-lme_temp.pdf' -> './import/processed'

I read that the file is automatically closed when the processor gets out of scope, but... not the case?

mara004 commented 3 months ago

I read that the file is automatically closed when the processor gets out of scope, but... not the case?

The file is automatically closed when the object is garbage collected/finalized. Python, unlike Rust, does not have deterministic memory management. There can be an arbitrary delay from reaching refcount 0 to being collected/finalized.

So, explicitly closing the PdfDocument might fix the issue (try: ... finally: pdf.close()). Also make sure you don't have any other dangling handles to the file beside the PdfDocument.

For the "confirm not a build issue", I can't really confirm that. I'm really new to python, maybe I don't understand. Package is from: pip install pypdfium2

This simply means I would have intended you to use the PyPA issue template, not the generic one. Virtually everyone seems to do this wrong, so I suppose it is just a bit too confusing. The PyPA template merely has a few diagnostic commands to identify the pypdfium2, python and OS versions used. Anyway, I think that was not relevant here.

CaptainPalapa commented 3 months ago

Thank you @mara004 This exactly solved my problem! On my five year old dev machine, I can grab my first available PDF, extract the text to a new.txt file and move the pdf to a /processed folder in as low as 12ms. Woot. Thanks!

I should also let you know that I tried four other PDF libs before I came across yours, but none of those would extract the text in the correct order, for some reason. Thanks for the great work on getting it correct! 😄

pypdfium2-team / pypdfium2