opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
13.43k stars 1.01k forks source link

The model repeatedly initializes when processing multiple PDFs in a single process, and it does not implement a singleton pattern. #502

Closed drunkpig closed 1 month ago

drunkpig commented 2 months ago

Description of the bug | 错误描述

The model repeatedly initializes when processing multiple PDFs in a single process, and it does not implement a singleton pattern.

img_v3_02e6_cc3fa93f-a99c-4bb4-a627-e8eefc887dcg

How to reproduce the bug | 如何复现

 for ii, pdf_info in enumerate(all_input_jsonl_lines): # 获取到属于这个GPU的切片
        track_id = pdf_info['track_id']
        temp_json_save_file = os.path.join(temp_json_save_path, f"{track_id}.json") # 一本书临时保存到本地的json文件
        # 检查本地是否已经存在了
        if os.path.exists(temp_json_save_file):
            logger.info(f"{temp_json_save_file} already exists, skip.")
            continue 

        s3_pdf_path = pdf_info['path']   
        s3_pdf_client = get_s3_cli_from_pool(s3_pdf_path)

        #  读取pdf文件到内存里
        pdf_bytes = get_pdf_bytes(s3_pdf_path, s3_pdf_client)
        magicpdf = UNIPipe(pdf_bytes, {"_pdf_type":"", "model_list":[]}, image_writer=None)
        # fitz 获取页码数
        doc = fitz.open(stream=pdf_bytes, filetype="pdf")
        page_count = doc.page_count
        doc.close()
        extract_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        try:
            magicpdf.pipe_classify()
            magicpdf.pipe_analyze()
            doc_layout_result = magicpdf.model_list
            pdf_info["doc_layout_result"] = doc_layout_result
        except Exception as e:
            logger.exception(e)
            err_info = str(e)
            __set_extra_info(pdf_info, "__error", err_info)

        __set_extra_info(pdf_info, "__inference_datetime", extract_time)
        __set_extra_info(pdf_info, "__mineru_inference_version", magic_pdf_version.__version__)

        #outputs.append(pdf_info)
        logger.info(f"processed {ii}/{total_pdfs} pdfs")

        ###################################################
        ## 保存这个pdf的结果到本地文件里,等整个json在每块GPU上都处理完全,之后一次上传到ceph
        ###################################################

        with open(temp_json_save_file,'w') as ff:
            ff.write(json.dumps(pdf_info, ensure_ascii=False))

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.7.x

Device mode | 设备模式

cuda

drunkpig commented 1 month ago

not a bug