IronK77 commented 2 months ago

Hey guys, I have seen some interesting tricks of pymupdf to use multiple cpus to deal with a range of pages within a pdf. My question would be in a different direction.

Say I have quite a lot (1000) pdfs to be transformed into markdown documents for each. Are there any methods to do it more efficiently? I have tried multiprocessing.pool and a normal for loop, but it seems that it fails in 90% of cases in multiprocessing (got ValueError) compared to an error-free for loop. I think the basic structure is shown below, so I have a list with the path of the file, I need to transform them into md files.

import pymupdf4llm import pathlib import pandas as pd

from multiprocessing.dummy import Pool from tqdm import tqdm import gc

def md_convert_esg(record_number): path=data_new.report_url[record_number] file_name=path.split("/")[-1] path_full="Fake Path"+file_name try: md_text = pymupdf4llm.to_markdown(path_full,write_images=False) path_o="Fake path 1"+regex.sub("pdf","md",file_name) pathlib.Path(path_o).write_bytes(md_text.encode()) return "" except Exception as e: return e

for loop

temp=[] for i in tqdm(range(len(data_new))): temp.append(md_convert_esg(i)) gc.collect()

multiprocessing

with Pool(4) as p: temp=list(tqdm(p.imap(md_convert_esg, range(len(data_new))),total=len(data_new))) gc.collect()

JorjMcKie commented 2 months ago

No, there is currently no support for that. To process multiple files, simply start separate scripts, one per file.

IronK77 commented 2 months ago

Okay thanks, I will try other possibilities.

pymupdf / RAG

Can pymupdf4llm work with multiprocessing? #79

for loop

multiprocessing