jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

Functions that can be multi-threaded - Enhancement to documentation #995

Open sandzone opened 9 months ago

sandzone commented 9 months ago

With reference to #91

Is extract_tables the only function with this issue?

I am using multiprocessing with extract_words and haven't faced this issue so far. I wonder if this is just luck or if extract_words doesn't depend on document-wide ._tokens issue that @jsvine mentioned in #91

It will be very helpful if this aspect is mentioned in the documentation.

jsvine commented 9 months ago

Interesting. My best guess is "just luck," since they use the same underlying PDF-parsing process.

Pk13055 commented 8 months ago

I was able to use multi-threading no problem :) You need to use ThreadPoolExecutor instead of underlying low-level threading.Thread

jsvine commented 7 months ago

Thanks for the note, @Pk13055! Are you able to share some code that demonstrates your approach?

Pk13055 commented 7 months ago

Here's a small example I put together. It may not run off-the-bat, but should provide a general idea:

from asyncio import gather, ensure_future, get_event_loop, run

import pdfplumber

async def process_page(page):

    processed = page.extract_tables()
    # do other stuff with page
    return processed

async def main():
    pdf = pdfplumber.open("test.pdf")
    loop = get_event_loop()
    futures = []
    for pg_idx in range(len(pdf.pages)):
        page = pdf.pages[pg_idx]
        futures.append(ensure_future(process_page(page), loop=loop))
    await gather(*futures)

if __name__ == "__main__":
    run(main())

I found this approach to be much faster than using a ThreadPoolExecutor, but here's an example anyway:

from concurrent.futures import ThreadPoolExecutor, as_completed
from asyncio import run

import pdfplumber

async def process_page(page):
    processed = page.extract_tables()
    # do other stuff with page
    return processed

async def main():
    pdf = pdfplumber.open("test.pdf")

    futures = []
    with ThreadPoolExecutor() as executor:
        for pg_idx in range(len(pdf.pages)):
            page = pdf.pages[pg_idx]
            futures.append(executor.submit(process_page, page))

    for res in as_completed(futures):
        processed = res.result()
        # do something with processed

if __name__ == "__main__":
    run(main())
jsvine commented 7 months ago

Thanks! @sandzone: Does @Pk13055's approach work for you?