opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://mineru.readthedocs.io/
GNU Affero General Public License v3.0
16.58k stars 1.2k forks source link

magic_pdf.tools.cli:parse_doc:109 #851

Closed coolboy5298 closed 1 week ago

coolboy5298 commented 1 week ago

Description of the bug | 错误描述

magic_pdf.tools.cli:parse_doc:109

How to reproduce the bug | 如何复现

(MinerU) E:>magic-pdf -p 使用搭建天气小助手智能体.pdf -o /temp 2024-11-04 08:39:53.177 | INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 1729, cid_chars_radio: 0.0 2024-11-04 08:40:06.935 | ERROR | magic_pdf.tools.cli:parse_doc:109 - Expecting ',' delimiter: line 18 column 34 (char 513) Traceback (most recent call last):

File "C:\Users\ccier.conda\envs\MinerU\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, │ │ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "C:\Users\ccier.conda\envs\Mi... │ └ <code object at 0x00000275F3EE7E10, file "C:\Users\ccier.conda\envs\MinerU\Scripts\magic-pdf.exe__main__.py", line 1> └ <function _run_code at 0x00000275F3ED12D0>

File "C:\Users\ccier.conda\envs\MinerU\lib\runpy.py", line 86, in _run_code exec(code, run_globals) │ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "C:\Users\ccier.conda\envs\Mi... └ <code object at 0x00000275F3EE7E10, file "C:\Users\ccier.conda\envs\MinerU\Scripts\magic-pdf.exe__main__.py", line 1>

File "C:\Users\ccier.conda\envs\MinerU\Scripts\magic-pdf.exe__main__.py", line 7, in sys.exit(cli()) │ │ └ │ └ └ <module 'sys' (built-in)>

File "C:\Users\ccier.conda\envs\MinerU\lib\site-packages\click\core.py", line 1157, in call return self.main(*args, **kwargs) │ │ │ └ {} │ │ └ () │ └ <function BaseCommand.main at 0x00000275F4365120> └

File "C:\Users\ccier.conda\envs\MinerU\lib\site-packages\click\core.py", line 1078, in main rv = self.invoke(ctx) │ │ └ <click.core.Context object at 0x00000275F3F38CD0> │ └ <function Command.invoke at 0x00000275F4365BD0> └

File "C:\Users\ccier.conda\envs\MinerU\lib\site-packages\click\core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) │ │ │ │ │ └ {'path': '使用搭建天气小助手智能体.pdf', 'output_dir': '/temp', 'method': 'auto', 'lang': None, 'debug_able': False, 'start_page_i... │ │ │ │ └ <click.core.Context object at 0x00000275F3F38CD0> │ │ │ └ <function cli at 0x00000275A5B18040> │ │ └ │ └ <function Context.invoke at 0x00000275F4364940> └ <click.core.Context object at 0x00000275F3F38CD0>

File "C:\Users\ccier.conda\envs\MinerU\lib\site-packages\click\core.py", line 783, in invoke return __callback(*args, **kwargs) │ └ {'path': '使用搭建天气小助手智能体.pdf', 'output_dir': '/temp', 'method': 'auto', 'lang': None, 'debug_able': False, 'start_page_i... └ ()

File "C:\Users\ccier.conda\envs\MinerU\lib\site-packages\magic_pdf\tools\cli.py", line 115, in cli parse_doc(path) │ └ '使用+MaxKB+搭建天气小助手智能体.pdf' └ <function cli..parse_doc at 0x00000275F3F1F640>

File "C:\Users\ccier.conda\envs\MinerU\lib\site-packages\magic_pdf\tools\cli.py", line 96, in parse_doc do_parse( └ <function do_parse at 0x00000275A5AFF370>

File "C:\Users\ccier.conda\envs\MinerU\lib\site-packages\magic_pdf\tools\common.py", line 87, in do_parse pipe.pipe_analyze() │ └ <function UNIPipe.pipe_analyze at 0x00000275A5AFF880> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x00000275A5B0C7C0>

File "C:\Users\ccier.conda\envs\MinerU\lib\site-packages\magic_pdf\pipe\UNIPipe.py", line 32, in pipe_analyze self.model_list = doc_analyze(self.pdf_bytes, ocr=False, │ │ │ │ └ b'%PDF-1.5\n%\xe2\xe3\xcf\xd3\n3 0 obj\n<</ColorSpace/DeviceRGB/Subtype/Image/Height 1408/Filter/FlateDecode/Type/XObject/Wid... │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x00000275A5B0C7C0> │ │ └ <function doc_analyze at 0x00000275A5771AB0> │ └ [] └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x00000275A5B0C7C0>

File "C:\Users\ccier.conda\envs\MinerU\lib\site-packages\magic_pdf\model\doc_analyze_by_custom_model.py", line 147, in doc_analyze custom_model = model_manager.get_model(ocr, show_log, lang, layout_model, formula_enable, table_enable) │ │ │ │ │ │ │ └ None │ │ │ │ │ │ └ None │ │ │ │ │ └ None │ │ │ │ └ None │ │ │ └ False │ │ └ False │ └ <function ModelSingleton.get_model at 0x00000275A5771A20> └ <magic_pdf.model.doc_analyze_by_custom_model.ModelSingleton object at 0x00000275A5B0D780>

File "C:\Users\ccier.conda\envs\MinerU\lib\site-packages\magic_pdf\model\doc_analyze_by_custom_model.py", line 75, in get_model self._models[key] = custom_model_init(ocr=ocr, show_log=show_log, lang=lang, layout_model=layout_model, │ │ │ │ │ │ │ └ None │ │ │ │ │ │ └ None │ │ │ │ │ └ False │ │ │ │ └ False │ │ │ └ <function custom_model_init at 0x00000275A5771900> │ │ └ (False, False, None, None, None, None) │ └ {} └ <magic_pdf.model.doc_analyze_by_custom_model.ModelSingleton object at 0x00000275A5B0D780>

File "C:\Users\ccier.conda\envs\MinerU\lib\site-packages\magic_pdf\model\doc_analyze_by_custom_model.py", line 100, in custom_model_init local_models_dir = get_local_models_dir() └ <function get_local_models_dir at 0x00000275A57713F0>

File "C:\Users\ccier.conda\envs\MinerU\lib\site-packages\magic_pdf\libs\config_reader.py", line 59, in get_local_models_dir config = read_config() └ <function read_config at 0x00000275A5771120>

File "C:\Users\ccier.conda\envs\MinerU\lib\site-packages\magic_pdf\libs\config_reader.py", line 26, in read_config config = json.load(f) │ │ └ <_io.TextIOWrapper name='C:\\Users\\ccier\\magic-pdf.json' mode='r' encoding='utf-8'> │ └ <function load at 0x00000275F5EAA320> └ <module 'json' from 'C:\Users\ccier\.conda\envs\MinerU\lib\json\init.py'>

File "C:\Users\ccier.conda\envs\MinerU\lib\json__init__.py", line 293, in load return loads(fp.read(), │ │ └ <method 'read' of '_io.TextIOWrapper' objects> │ └ <_io.TextIOWrapper name='C:\\Users\\ccier\\magic-pdf.json' mode='r' encoding='utf-8'> └ <function loads at 0x00000275F5EAA3B0>

File "C:\Users\ccier.conda\envs\MinerU\lib\json__init__.py", line 346, in loads return _default_decoder.decode(s) │ │ └ '{\n "bucket_info": {\n "bucket-name-1": [\n "ak",\n "sk",\n "endpoint"\n ]... │ └ <function JSONDecoder.decode at 0x00000275F5EA9C60> └ <json.decoder.JSONDecoder object at 0x00000275F5EB8220>

File "C:\Users\ccier.conda\envs\MinerU\lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) │ │ │ │ └ '{\n "bucket_info": {\n "bucket-name-1": [\n "ak",\n "sk",\n "endpoint"\n ]... │ │ │ └ <built-in method match of re.Pattern object at 0x00000275F5E51080> │ │ └ '{\n "bucket_info": {\n "bucket-name-1": [\n "ak",\n "sk",\n "endpoint"\n ]... │ └ <function JSONDecoder.raw_decode at 0x00000275F5EA9CF0> └ <json.decoder.JSONDecoder object at 0x00000275F5EB8220>

File "C:\Users\ccier.conda\envs\MinerU\lib\json\decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) │ │ │ └ 0 │ │ └ '{\n "bucket_info": {\n "bucket-name-1": [\n "ak",\n "sk",\n "endpoint"\n ]... │ └ <_json.Scanner object at 0x00000275F5E674C0> └ <json.decoder.JSONDecoder object at 0x00000275F5EB8220>

json.decoder.JSONDecodeError: Expecting ',' delimiter: line 18 column 34 (char 513)

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.8.x

Device mode | 设备模式

cpu

myhloli commented 1 week ago

json配置文件格式不规范,可以看一下第18行是不是少了一个逗号