pypdfium2-team / pypdfium2

Python bindings to PDFium
https://pypdfium2.readthedocs.io/
349 stars 15 forks source link

PdfDocument.get_page is non-thread-safe #303

Closed kangtsang closed 6 months ago

kangtsang commented 6 months ago

Checklist

Description

Reason for Generic issue (keyword/topic) using multiple threads to read pdf exception occurs

Description

poetry show pypdfium2 Using python3 (3.10.0) name : pypdfium2
version : 4.28.0

Script to reproduce issue

import threading
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path

import pypdfium2 as pdfium
import logging

lock = threading.Lock()

def thread_worker_callback(worker):
    exception = worker.exception()
    if exception:
        logging.error('task exception: {}'.format(exception))
        logging.exception(exception)

indexThreadPool = ThreadPoolExecutor(2, 'loader-thread-pool')

def pypdfium_test():
    print('test start')
    path = Path('./docs')
    pdf_files = list(path.rglob("**/*.pdf"))
    for pdf_file in pdf_files:
        worker = indexThreadPool.submit(unsafe_load_pdf, pdf_file)
        worker.add_done_callback(thread_worker_callback)

def unsafe_load_pdf(file):
    pdf_reader = pdfium.PdfDocument(file, autoclose=True)
    try:
        for page_number, page in enumerate(pdf_reader):
            text_page = page.get_textpage()
            content = text_page.get_text_range()
            text_page.close()
            page.close()
    finally:
        pdf_reader.close()
    print(f'pdf load finish {file}')

if __name__ == '__main__':
    pypdfium_test()
ERROR:root:task exception: exception: access violation reading 0x0000028234E7E280
ERROR:root:exception: access violation reading 0x0000028234E7E280
Traceback (most recent call last):
  File "C:\Users\dell\AppData\Local\Programs\Python\Python310\lib\concurrent\futures\thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "E:\git\pubilc\pypdfium\main.py", line 33, in unsafe_load_pdf
    for page_number, page in enumerate(pdf_reader):
  File "E:\git\pubilc\pypdfium\venv\lib\site-packages\pypdfium2\_helpers\document.py", line 118, in __iter__
    yield self[i]
  File "E:\git\pubilc\pypdfium\venv\lib\site-packages\pypdfium2\_helpers\document.py", line 121, in __getitem__
    return self.get_page(i)
  File "E:\git\pubilc\pypdfium\venv\lib\site-packages\pypdfium2\_helpers\document.py", line 368, in get_page
    raw_page = pdfium_c.FPDF_LoadPage(self, index)
OSError: exception: access violation reading 0x0000028234E7E280

Install Info

Name: pypdfium2
Version: 4.28.0
Summary: Python bindings to PDFium
Home-page: https://github.com/pypdfium2-team/pypdfium2
Author: pypdfium2-team
Author-email: geisserml@gmail.com

Validity

mara004 commented 6 months ago

This is expected: pdfium, and therefore also pypdfium2, are not thread-compatible: https://pypdfium2.readthedocs.io/en/stable/python_api.html#thread-incompatibility https://pdfium.googlesource.com/pdfium/+/b8ea87677cb882613f37094fe681876e9eaa3e16/public/fpdfview.h#11 It is not allowed to call pdfium functions simultaneously across different threads, not even with different documents.

However, you may still use pdfium in a threaded context if it is ensured that only a single pdfium call can be made at a time (e.g. do pdfium work only in one thread, and other work in other threads, or install a mutex to linearize pdfium calls throughout threads).

I decided not to wrap pdfium functions in a global mutex because it's not clear to me what impact that might have on performance (consider frequently-called APIs such as the FPDFText_GetChar*() family, where there are already concerns with FFI overhead). Finer-grained locking as needed on the caller side seemed more elegant.