pypdfium2-team / pypdfium2

Python bindings to PDFium
https://pypdfium2.readthedocs.io/
349 stars 15 forks source link

After get_text_bounded, cannot MOVE file #317

Closed CaptainPalapa closed 3 months ago

CaptainPalapa commented 3 months ago

Checklist

Reason for Generic issue (keyword/topic)

File does not get closed

Description

For the "confirm not a build issue", I can't really confirm that. I'm really new to python, maybe I don't understand. Package is from: pip install pypdfium2

Here is the code:

import pypdfium2 as pdfium
def extract_pdf(self, pdf_file):
        pdf = pdfium.PdfDocument(pdf_file)
        n_pages = len(pdf)  # get the number of pages in the document
        text = ""
        for i in range(n_pages):
            page = pdf[i]  # load a page
            textpage = page.get_textpage()
            text = text + textpage.get_text_bounded()

        return text

After this function completes, I attempt to move the file from a /incoming to a /processed folder, but I get:

  File ["...]FileService.py", line 44, in move_file
    os.rename(source, destination)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '/import/incoming/20200407RVWB2851342013171555418724-lme_temp.pdf' -> './import/processed'

I read that the file is automatically closed when the processor gets out of scope, but... not the case?

mara004 commented 3 months ago

I read that the file is automatically closed when the processor gets out of scope, but... not the case?

The file is automatically closed when the object is garbage collected/finalized. Python, unlike Rust, does not have deterministic memory management. There can be an arbitrary delay from reaching refcount 0 to being collected/finalized.

So, explicitly closing the PdfDocument might fix the issue (try: ... finally: pdf.close()). Also make sure you don't have any other dangling handles to the file beside the PdfDocument.

For the "confirm not a build issue", I can't really confirm that. I'm really new to python, maybe I don't understand. Package is from: pip install pypdfium2

This simply means I would have intended you to use the PyPA issue template, not the generic one. Virtually everyone seems to do this wrong, so I suppose it is just a bit too confusing. The PyPA template merely has a few diagnostic commands to identify the pypdfium2, python and OS versions used. Anyway, I think that was not relevant here. image

CaptainPalapa commented 3 months ago

Thank you @mara004 This exactly solved my problem! On my five year old dev machine, I can grab my first available PDF, extract the text to a new.txt file and move the pdf to a /processed folder in as low as 12ms. Woot. Thanks!

I should also let you know that I tried four other PDF libs before I came across yours, but none of those would extract the text in the correct order, for some reason. Thanks for the great work on getting it correct! 😄