pypdfium2-team / pypdfium2

Python bindings to PDFium
https://pypdfium2.readthedocs.io/
425 stars 17 forks source link

Weird "PDFium: Data format error" when using pypdfium2 in Celery task. #309

Closed elonzh closed 7 months ago

elonzh commented 7 months ago

Checklist

Description

I am using pypdfium2 reading PDF metadata and texts with celery workers.

Celery workers runs in thread pool mode with 4 concurrencies.

Those tasks will start failing randomly as the sametime even files are valid PDF, and it seems can't recover unless I restart the container.

Seems it's a threading issue because PDFium is not thread-safe?

At first, I suspected whether it was a system load problem(CPU/Memory), but I think it is not related with system load after two or three days of observation.

image

image

Install Info

pypdfium2 4.27.0
pdfium 123.0.6281.0
Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

Validity

mara004 commented 7 months ago

Yes, it is not allowed to access pdfium functions simultaneously across different threads. If that is not ensured, then arbitrary issues may arise (including security issues). If you are sure the input data is correct (e.g. by testing with a single-threaded script), then this will likely be the cause.

mara004 commented 7 months ago

However, if it's 4 concurrent tasks in a thread pool, then I'd expect that it should crash somewhat immediately. Your report suggests to me it might seem to work longer than that?

elonzh commented 7 months ago

Yes, those tasks continuously complains "PDFium: Data format error" and the worker doesn't exit.

edWin-m commented 5 months ago

@elonzh How were you able to resolve this issue?

elonzh commented 5 months ago

@elonzh How were you able to resolve this issue?

Just change worker pool type to prefork type.

edWin-m commented 5 months ago

@elonzh How were you able to resolve this issue?

Just change worker pool type to prefork type.

Thanks.

homeant commented 3 months ago

How to solve this problem in fastapi, does any expert know?

elonzh commented 3 months ago

@homeant lock resource between operations should be ok.

HaoRenkk123 commented 2 months ago

fastapi中如何解决这个问题,有高手知道吗? Excuse me, has it been resolved? I am also using FastAPI+ThreadPoolExecutor, and in each sentence of pdfium PdfDocument() has added a global lock: with pdf_lock: pdf_doc = pdfium.PdfDocument(pdf_path) But it still reports that kind of mistake

mara004 commented 2 months ago

@HaoRenkk123 This is not a pypdfium2 issue, but a caller-side integration issue, and as such cannot be "resolved" in this project. I don't use celery/fastapi personally. Take a look at the hints provided by @elonzh. Note that only locking PdfDocument construction is not sufficient: you'd have to guard all code that uses pdfium. It's not just PdfDocument – none of (py)pdfium's APIs are thread-compatible. See also https://pypdfium2.readthedocs.io/en/stable/python_api.html#thread-incompatibility

edWin-m commented 2 months ago

@elonzh How were you able to resolve this issue?

Just change worker pool type to prefork type.

@HaoRenkk123 You can start from here. That worked well with my implementation. I am using it with FastAPI.