opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
10.91k stars 805 forks source link

AssertionError: Dataset 'scihub_train' is already registered! #184

Closed yuanyehome closed 1 month ago

yuanyehome commented 1 month ago

Description of the bug | 错误描述

我把处理函数封装成了如下形式:

def process_one_file(file_path: str, output_path: str):
    os.makedirs(output_path, exist_ok=True)
    try:
        pdf_bytes = open(file_path, "rb").read()
        model_json = []

        jso_useful_key = {"_pdf_type": "", "model_list": model_json}
        local_image_dir = os.path.join(output_path, "images")
        image_dir = str(os.path.basename(local_image_dir))
        image_writer = DiskReaderWriter(local_image_dir)
        pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True)
        pipe.pipe_classify()
        """
        如果没有传入有效的模型数据则使用内置model解析
        """
        if len(model_json) == 0:
            if model_config.__use_inside_model__:
                pipe.pipe_analyze()
            else:
                logger.error("need model list input")
                raise ValueError("need model list input")
        pipe.pipe_parse()
        md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
        with open(
            f"{output_path}/{osp.basename(output_path)}.md", "w", encoding="utf-8"
        ) as f:
            f.write(md_content)
        orig_model_list = copy.deepcopy(pipe.model_list)
        with open(f"{output_path}/model_list.json", "w") as f:
            json.dump(orig_model_list, f, ensure_ascii=False, indent=4)
    except Exception as e:
        logger.exception(e)
        exit(-1)

但是发现在第二次调用这个函数的时候detectron2里面就会报错:AssertionError: Dataset 'scihub_train' is already registered!

完整错误如下:

Traceback (most recent call last):

  File "/home/yuanye/pdf-extract/run-test.py", line 80, in <module>
    main()
    └ <function main at 0x7f754a81e950>

  File "/home/yuanye/pdf-extract/run-test.py", line 69, in main
    process_one_file(file_path, output_path)
    │                │          └ '/home/yuanye/pdf-extract/example-outputs/example-simple-text-pdf'
    │                └ '/home/yuanye/pdf-extract/pdf-examples/example-simple-text.pdf'
    └ <function process_one_file at 0x7f754ce01990>

> File "/home/yuanye/pdf-extract/run-test.py", line 32, in process_one_file
    pipe.pipe_analyze()
    │    └ <function UNIPipe.pipe_analyze at 0x7f73fca40ee0>
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f7307c00220>

  File "/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py", line 29, in pipe_analyze
    self.model_list = doc_analyze(self.pdf_bytes, ocr=False)
    │    │            │           │    └ b'%PDF-1.4\n%\xaa\xab\xac\xad\n1 0 obj\n<<\n/Title (Workload Pipelining)\n/Author (Arm Ltd.)\n/Subject (In this guide, read a...
    │    │            │           └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f7307c00220>
    │    │            └ <function doc_analyze at 0x7f74a611b370>
    │    └ []
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f7307c00220>
  File "/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 69, in doc_analyze
    custom_model = CustomPEKModel(ocr=ocr, show_log=show_log, models_dir=local_models_dir, device=device)
                   │                  │             │                    │                        └ 'cuda'
                   │                  │             │                    └ '/home/yuanye/PDF-Extract-Kit/models'
                   │                  │             └ False
                   │                  └ False
                   └ <class 'magic_pdf.model.pdf_extract_kit.CustomPEKModel'>
  File "/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 115, in __init__
    self.layout_model = Layoutlmv3_Predictor(
    │                   └ <class 'magic_pdf.model.pek_sub_modules.layoutlmv3.model_init.Layoutlmv3_Predictor'>
    └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7f7307c346a0>
  File "/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/magic_pdf/model/pek_sub_modules/layoutlmv3/model_init.py", line 122, in __init__
    cfg = setup(layout_args, device)
          │     │            └ 'cuda'
          │     └ {'config_file': '/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layout...
          └ <function setup at 0x7f7308d7f520>
  File "/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/magic_pdf/model/pek_sub_modules/layoutlmv3/model_init.py", line 82, in setup
    register_coco_instances(
    └ <function register_coco_instances at 0x7f730a999870>
  File "/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/detectron2/data/datasets/coco.py", line 510, in register_coco_instances
    DatasetCatalog.register(name, lambda: load_coco_json(json_file, image_root, name))
    │              │        │             │              │          │           └ 'scihub_train'
    │              │        │             │              │          └ '/mnt/petrelfs/share_data/zhaozhiyuan/publaynet/layout_scihub/train'
    │              │        │             │              └ '/mnt/petrelfs/share_data/zhaozhiyuan/publaynet/layout_scihub/train.json'
    │              │        │             └ <function load_coco_json at 0x7f730a998f70>
    │              │        └ 'scihub_train'
    │              └ <function _DatasetCatalog.register at 0x7f730a97c430>
    └ DatasetCatalog(registered datasets: coco_2014_train, coco_2014_val, coco_2014_minival, coco_2014_valminusminival, coco_2017_t...
  File "/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/detectron2/data/catalog.py", line 37, in register
    assert name not in self, "Dataset '{}' is already registered!".format(name)
           │           │                                                  └ 'scihub_train'
           │           └ DatasetCatalog(registered datasets: coco_2014_train, coco_2014_val, coco_2014_minival, coco_2014_valminusminival, coco_2017_t...
           └ 'scihub_train'

AssertionError: Dataset 'scihub_train' is already registered!

How to reproduce the bug | 如何复现

def process_one_file(file_path: str, output_path: str):
    os.makedirs(output_path, exist_ok=True)
    try:
        pdf_bytes = open(file_path, "rb").read()
        model_json = []

        jso_useful_key = {"_pdf_type": "", "model_list": model_json}
        local_image_dir = os.path.join(output_path, "images")
        image_dir = str(os.path.basename(local_image_dir))
        image_writer = DiskReaderWriter(local_image_dir)
        pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True)
        pipe.pipe_classify()
        """
        如果没有传入有效的模型数据则使用内置model解析
        """
        if len(model_json) == 0:
            if model_config.__use_inside_model__:
                pipe.pipe_analyze()
            else:
                logger.error("need model list input")
                raise ValueError("need model list input")
        pipe.pipe_parse()
        md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
        with open(
            f"{output_path}/{osp.basename(output_path)}.md", "w", encoding="utf-8"
        ) as f:
            f.write(md_content)
        orig_model_list = copy.deepcopy(pipe.model_list)
        with open(f"{output_path}/model_list.json", "w") as f:
            json.dump(orig_model_list, f, ensure_ascii=False, indent=4)
    except Exception as e:
        logger.exception(e)
        exit(-1)

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

myhloli commented 1 month ago

clone最新的master分支,该问题已经解决于https://github.com/opendatalab/MinerU/commit/724001df541f51bbef7800835121e2092a252d5e 在新的release打包前,可以通过源码安装解决该问题。

pip uninstall magic-pdf
python setup.py install
myhloli commented 1 month ago

We have updated to the 0.6.2b1 release, addressing and resolving the aforementioned issue.