Hey guys, I have seen some interesting tricks of pymupdf to use multiple cpus to deal with a range of pages within a pdf. My question would be in a different direction.
Say I have quite a lot (1000) pdfs to be transformed into markdown documents for each. Are there any methods to do it more efficiently? I have tried multiprocessing.pool and a normal for loop, but it seems that it fails in 90% of cases in multiprocessing (got ValueError) compared to an error-free for loop. I think the basic structure is shown below, so I have a list with the path of the file, I need to transform them into md files.
import pymupdf4llm
import pathlib
import pandas as pd
from multiprocessing.dummy import Pool
from tqdm import tqdm
import gc
Hey guys, I have seen some interesting tricks of pymupdf to use multiple cpus to deal with a range of pages within a pdf. My question would be in a different direction.
Say I have quite a lot (1000) pdfs to be transformed into markdown documents for each. Are there any methods to do it more efficiently? I have tried multiprocessing.pool and a normal for loop, but it seems that it fails in 90% of cases in multiprocessing (got ValueError) compared to an error-free for loop. I think the basic structure is shown below, so I have a list with the path of the file, I need to transform them into md files.
import pymupdf4llm import pathlib import pandas as pd
from multiprocessing.dummy import Pool from tqdm import tqdm import gc
def md_convert_esg(record_number): path=data_new.report_url[record_number] file_name=path.split("/")[-1] path_full="Fake Path"+file_name try: md_text = pymupdf4llm.to_markdown(path_full,write_images=False) path_o="Fake path 1"+regex.sub("pdf","md",file_name) pathlib.Path(path_o).write_bytes(md_text.encode()) return "" except Exception as e: return e
for loop
temp=[] for i in tqdm(range(len(data_new))): temp.append(md_convert_esg(i)) gc.collect()
multiprocessing
with Pool(4) as p: temp=list(tqdm(p.imap(md_convert_esg, range(len(data_new))),total=len(data_new))) gc.collect()