Closed elonzh closed 7 months ago
Yes, it is not allowed to access pdfium functions simultaneously across different threads. If that is not ensured, then arbitrary issues may arise (including security issues). If you are sure the input data is correct (e.g. by testing with a single-threaded script), then this will likely be the cause.
However, if it's 4 concurrent tasks in a thread pool, then I'd expect that it should crash somewhat immediately. Your report suggests to me it might seem to work longer than that?
Yes, those tasks continuously complains "PDFium: Data format error" and the worker doesn't exit.
@elonzh How were you able to resolve this issue?
@elonzh How were you able to resolve this issue?
Just change worker pool type to prefork type.
@elonzh How were you able to resolve this issue?
Just change worker pool type to prefork type.
Thanks.
How to solve this problem in fastapi, does any expert know?
@homeant lock resource between operations should be ok.
fastapi中如何解决这个问题,有高手知道吗? Excuse me, has it been resolved? I am also using FastAPI+ThreadPoolExecutor, and in each sentence of pdfium PdfDocument() has added a global lock: with pdf_lock: pdf_doc = pdfium.PdfDocument(pdf_path) But it still reports that kind of mistake
@HaoRenkk123 This is not a pypdfium2 issue, but a caller-side integration issue, and as such cannot be "resolved" in this project. I don't use celery/fastapi personally. Take a look at the hints provided by @elonzh.
Note that only locking PdfDocument
construction is not sufficient: you'd have to guard all code that uses pdfium. It's not just PdfDocument
– none of (py)pdfium's APIs are thread-compatible.
See also https://pypdfium2.readthedocs.io/en/stable/python_api.html#thread-incompatibility
@elonzh How were you able to resolve this issue?
Just change worker pool type to prefork type.
@HaoRenkk123 You can start from here. That worked well with my implementation. I am using it with FastAPI.
Checklist
pypdfium2
fromPyPI
orGitHub/pypdfium2-team
.Description
I am using pypdfium2 reading PDF metadata and texts with celery workers.
Celery workers runs in thread pool mode with 4 concurrencies.
Those tasks will start failing randomly as the sametime even files are valid PDF, and it seems can't recover unless I restart the container.
Seems it's a threading issue because PDFium is not thread-safe?
At first, I suspected whether it was a system load problem(CPU/Memory), but I think it is not related with system load after two or three days of observation.
Install Info
Validity