Open sandzone opened 9 months ago
Interesting. My best guess is "just luck," since they use the same underlying PDF-parsing process.
I was able to use multi-threading no problem :) You need to use ThreadPoolExecutor
instead of underlying low-level threading.Thread
Thanks for the note, @Pk13055! Are you able to share some code that demonstrates your approach?
Here's a small example I put together. It may not run off-the-bat, but should provide a general idea:
from asyncio import gather, ensure_future, get_event_loop, run
import pdfplumber
async def process_page(page):
processed = page.extract_tables()
# do other stuff with page
return processed
async def main():
pdf = pdfplumber.open("test.pdf")
loop = get_event_loop()
futures = []
for pg_idx in range(len(pdf.pages)):
page = pdf.pages[pg_idx]
futures.append(ensure_future(process_page(page), loop=loop))
await gather(*futures)
if __name__ == "__main__":
run(main())
I found this approach to be much faster than using a ThreadPoolExecutor
, but here's an example anyway:
from concurrent.futures import ThreadPoolExecutor, as_completed
from asyncio import run
import pdfplumber
async def process_page(page):
processed = page.extract_tables()
# do other stuff with page
return processed
async def main():
pdf = pdfplumber.open("test.pdf")
futures = []
with ThreadPoolExecutor() as executor:
for pg_idx in range(len(pdf.pages)):
page = pdf.pages[pg_idx]
futures.append(executor.submit(process_page, page))
for res in as_completed(futures):
processed = res.result()
# do something with processed
if __name__ == "__main__":
run(main())
Thanks! @sandzone: Does @Pk13055's approach work for you?
With reference to #91
Is
extract_tables
the only function with this issue?I am using multiprocessing with
extract_words
and haven't faced this issue so far. I wonder if this is just luck or ifextract_words
doesn't depend on document-wide._tokens
issue that @jsvine mentioned in #91It will be very helpful if this aspect is mentioned in the documentation.