opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
https://opendatalab.com/OpenSourceTools?tool=extract
GNU Affero General Public License v3.0
18.23k stars 1.31k forks source link

新版本0.93报错 发现是公式解析模型的时候 #1023

Closed 3300752199 closed 4 days ago

3300752199 commented 4 days ago

Description of the bug | 错误描述

INFO:datasets:PyTorch version 2.3.1 available. 2024-11-19 10:04:34.628 | INFO | magic_pdf.model.pdf_extract_kit:init:68 - DocAnalysis init, this may take some times, layout_model: layoutlmv3, apply_formula: True, apply_ocr: True, apply_table: True, table_model: rapid_table, lang: None 2024-11-19 10:04:34.629 | INFO | magic_pdf.model.pdf_extract_kit:init:77 - using device: cuda 2024-11-19 10:04:34.629 | INFO | magic_pdf.model.pdf_extract_kit:init:79 - using models_dir: /home/data/PDF-Extract-Kit-1.0/models/ 2024-11-19 10:04:35.151 | ERROR | main::24 - Missing key length_aware full_key: model.model_config.length_aware object_type=dict Traceback (most recent call last):

File "/home/jupyter-yhy/MinerU/Mineru_test_demo.py", line 18, in pipe.pipe_analyze() │ └ <function OCRPipe.pipe_analyze at 0x7f93e013d240> └ <magic_pdf.pipe.OCRPipe.OCRPipe object at 0x7f94949b7f70>

File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/magic_pdf/pipe/OCRPipe.py", line 22, in pipe_analyze self.model_list = doc_analyze(self.pdf_bytes, ocr=True, │ │ │ │ └ b'%PDF-1.7\r\n%\xa1\xb3\xc5\xd7\r\n6 0 obj\r\n<</Filter/FlateDecode/Length 70>>stream\r\nx\x9c+\xe45T0\x00B]\x10ejb\xa0\x90\x... │ │ │ └ <magic_pdf.pipe.OCRPipe.OCRPipe object at 0x7f94949b7f70> │ │ └ <function doc_analyze at 0x7f93e17f2830> │ └ {'_pdf_type': '', 'model_list': []} └ <magic_pdf.pipe.OCRPipe.OCRPipe object at 0x7f94949b7f70> File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 147, in doc_analyze custom_model = model_manager.get_model(ocr, show_log, lang, layout_model, formula_enable, table_enable) │ │ │ │ │ │ │ └ None │ │ │ │ │ │ └ None │ │ │ │ │ └ None │ │ │ │ └ None │ │ │ └ False │ │ └ True │ └ <function ModelSingleton.get_model at 0x7f93e17f27a0> └ <magic_pdf.model.doc_analyze_by_custom_model.ModelSingleton object at 0x7f949469e3b0> File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 75, in get_model self._models[key] = custom_model_init(ocr=ocr, show_log=show_log, lang=lang, layout_model=layout_model, │ │ │ │ │ │ │ └ None │ │ │ │ │ │ └ None │ │ │ │ │ └ False │ │ │ │ └ True │ │ │ └ <function custom_model_init at 0x7f93e17f2680> │ │ └ (True, False, None, None, None, None) │ └ {} └ <magic_pdf.model.doc_analyze_by_custom_model.ModelSingleton object at 0x7f949469e3b0> File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 126, in custom_model_init custom_model = CustomPEKModel(model_input) │ └ {'ocr': True, 'show_log': False, 'models_dir': '/home/data/PDF-Extract-Kit-1.0/models/', 'device': 'cuda', 'table_config': {'... └ <class 'magic_pdf.model.pdf_extract_kit.CustomPEKModel'> File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 95, in init self.mfr_model = atom_model_manager.get_atom_model( │ │ └ <function AtomModelSingleton.get_atom_model at 0x7f939120c670> │ └ <magic_pdf.model.sub_modules.model_init.AtomModelSingleton object at 0x7f93e02bafe0> └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7f949469e3e0> File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/magic_pdf/model/sub_modules/model_init.py", line 94, in get_atom_model self._models[key] = atom_model_init(model_name=atom_model_name, kwargs) │ │ │ │ │ └ {'mfr_weight_dir': '/home/data/PDF-Extract-Kit-1.0/models/MFR/unimernet_small', 'mfr_cfg_path': '/home/jupyter-yhy/anaconda3/... │ │ │ │ └ 'mfr' │ │ │ └ <function atom_model_init at 0x7f939120c550> │ │ └ ('mfr', None, None) │ └ {('mfd', None, None): <magic_pdf.model.sub_modules.mfd.yolov8.YOLOv8.YOLOv8MFDModel object at 0x7f93c37f2260>} └ <magic_pdf.model.sub_modules.model_init.AtomModelSingleton object at 0x7f93e02bafe0> File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/magic_pdf/model/sub_modules/model_init.py", line 118, in atom_model_init atom_model = mfr_model_init( └ <function mfr_model_init at 0x7f939120c310> File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/magic_pdf/model/sub_modules/model_init.py", line 41, in mfr_model_init mfr_model = UnimernetModel(weight_dir, cfg_path, device) │ │ │ └ 'cuda' │ │ └ '/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/magic_pdf/resources/model_config/UniMERNet/demo.yaml' │ └ '/home/data/PDF-Extract-Kit-1.0/models/MFR/unimernet_small' └ <class 'magic_pdf.model.sub_modules.mfr.unimernet.Unimernet.UnimernetModel'> File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/magic_pdf/model/sub_modules/mfr/unimernet/Unimernet.py", line 61, in init self.model = task.build_model(cfg) │ │ │ └ <unimernet.common.config.Config object at 0x7f938f39bd30> │ │ └ <function BaseTask.build_model at 0x7f93ba24aa70> │ └ <unimernet.tasks.unimernet_train.UniMERNet_Train object at 0x7f938f39bdf0> └ <magic_pdf.model.sub_modules.mfr.unimernet.Unimernet.UnimernetModel object at 0x7f938f39bf70> File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/unimernet/tasks/base_task.py", line 33, in build_model return model_cls.from_config(model_config) │ │ └ {'arch': 'unimernet', 'load_finetuned': False, 'load_pretrained': True, 'pretrained': '/home/data/PDF-Extract-Kit-1.0/models/... │ └ <classmethod(<function UniMERModel.from_config at 0x7f93ba3b40d0>)> └ <class 'unimernet.models.unimernet.unimernet.UniMERModel'> File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/unimernet/models/unimernet/unimernet.py", line 102, in from_config model = cls( └ <class 'unimernet.models.unimernet.unimernet.UniMERModel'> File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/unimernet/models/unimernet/unimernet.py", line 41, in init length_aware=model_config.length_aware, └ {'max_seq_len': 1536, 'model_name': '/home/data/PDF-Extract-Kit-1.0/models/MFR/unimernet_small'} File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 355, in getattr self._format_and_raise( │ └ <function Node._format_and_raise at 0x7f93c3239a20> └ {'max_seq_len': 1536, 'model_name': '/home/data/PDF-Extract-Kit-1.0/models/MFR/unimernet_small'} File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/omegaconf/base.py", line 231, in _format_and_raise format_and_raise( └ <function format_and_raise at 0x7f93c3238790> File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/omegaconf/_utils.py", line 899, in format_and_raise _raise(ex, cause) │ │ └ ConfigKeyError('Missing key length_aware') │ └ ConfigAttributeError('Missing key length_aware\n full_key: model.model_config.length_aware\n object_type=dict') └ <function _raise at 0x7f93c3238700> File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/omegaconf/_utils.py", line 797, in _raise raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace │ │ │ └ │ │ └ <module 'sys' (built-in)> │ └ <method 'with_traceback' of 'BaseException' objects> └ ConfigAttributeError('Missing key length_aware\n full_key: model.model_config.length_aware\n object_type=dict') File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 351, in getattr return self._get_impl( │ └ <function DictConfig._get_impl at 0x7f93c324bc70> └ {'max_seq_len': 1536, 'model_name': '/home/data/PDF-Extract-Kit-1.0/models/MFR/unimernet_small'} File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl node = self._get_child( │ └ <function BaseContainer._get_child at 0x7f93c3248310> └ {'max_seq_len': 1536, 'model_name': '/home/data/PDF-Extract-Kit-1.0/models/MFR/unimernet_small'} File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 73, in _get_child child = self._get_node( │ └ <function DictConfig._get_node at 0x7f93c324bd90> └ {'max_seq_len': 1536, 'model_name': '/home/data/PDF-Extract-Kit-1.0/models/MFR/unimernet_small'} File "/home/jupyter-yhy/anaconda3/envs/ocr_test/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 480, in _get_node raise ConfigKeyError(f"Missing key {key!s}") └ <class 'omegaconf.errors.ConfigKeyError'>

omegaconf.errors.ConfigAttributeError: Missing key length_aware full_key: model.model_config.length_aware object_type=dict

How to reproduce the bug | 如何复现

提示缺少参数,这是什么错误

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cuda

myhloli commented 4 days ago

unimernet更新到0.2.1