Closed shmiluyu closed 3 months ago
Can you provide more stack trace information from the error?
Can you provide more stack trace information from the error?
`
2024-08-04 14:10:08.146 | WARNING | magic_pdf.cli.magicpdf:get_model_json:312 - not found json D:/work/github/magic-pdf/6105137170.json existed
2024-08-04 14:10:08.146 | WARNING | magic_pdf.libs.config_reader:get_local_dir:64 - 'temp-output-dir' not found in magic-pdf.json, use '/tmp' as default
2024-08-04 14:10:09.351 | INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 7728, cid_chars_radio: 0.0
2024-08-04 14:10:12.121 | ERROR | magic_pdf.model.pdf_extract_kit:
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
│ │ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "D:\dev-stuff\scoop\apps\anaco...
│ └ <code object
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
│ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "D:\dev-stuff\scoop\apps\anaco...
└ <code object
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\Scripts\magic-pdf.exe__main__.py", line 7, in
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\click\core.py", line 1157, in call
return self.main(*args, **kwargs)
│ │ │ └ {}
│ │ └ ()
│ └ <function BaseCommand.main at 0x00000259B6D2B640>
└
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\click\core.py", line 1078, in main
rv = self.invoke(ctx)
│ │ └ <click.core.Context object at 0x00000259B6928FA0>
│ └ <function MultiCommand.invoke at 0x00000259B6D3C670>
└
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\click\core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
│ │ │ │ └ <click.core.Context object at 0x000002598C6CF2E0>
│ │ │ └ <function Command.invoke at 0x00000259B6D3C160>
│ │ └
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\click\core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
│ │ │ │ │ └ {'pdf': 'D:/work/github/magic-pdf/6105137170.pdf', 'inside_model': True, 'model': None, 'method': 'auto', 'model_mode': 'full'}
│ │ │ │ └ <click.core.Context object at 0x000002598C6CF2E0>
│ │ │ └ <function pdf_command at 0x000002598C6F23B0>
│ │ └
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\click\core.py", line 783, in invoke return __callback(*args, **kwargs) │ └ {'pdf': 'D:/work/github/magic-pdf/6105137170.pdf', 'inside_model': True, 'model': None, 'method': 'auto', 'model_mode': 'full'} └ ()
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\magic_pdf\cli\magicpdf.py", line 352, in pdf_command
parse_doc(pdf)
│ └ 'D:/work/github/magic-pdf/6105137170.pdf'
└ <function pdf_command.
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\magic_pdf\cli\magicpdf.py", line 330, in parse_doc do_parse( └ <function do_parse at 0x000002598C6F1CF0>
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\magic_pdf\cli\magicpdf.py", line 111, in do_parse pipe.pipe_analyze() │ └ <function UNIPipe.pipe_analyze at 0x000002598C6F0DC0> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000002598C6CEE60>
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\magic_pdf\pipe\UNIPipe.py", line 29, in pipe_analyze self.model_list = doc_analyze(self.pdf_bytes, ocr=False) │ │ │ │ └ b'%PDF-1.3\n%\xe2\xe3\xcf\xd3\n11 0 obj\r<<\r/Length 12 0 R\r/Filter [ /FlateDecode ]\r>>\rstream\nx\x9c\xbdZ\xcd\xabeG\x11... │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000002598C6CEE60> │ │ └ <function doc_analyze at 0x00000259B9480D30> │ └ [] └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000002598C6CEE60>
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\magic_pdf\model\doc_analyze_by_custom_model.py", line 103, in doc_analyze custom_model = model_manager.get_model(ocr, show_log) │ │ │ └ False │ │ └ False │ └ <function ModelSingleton.get_model at 0x00000259B9480CA0> └ <magic_pdf.model.doc_analyze_by_custom_model.ModelSingleton object at 0x000002598C6CF7F0>
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\magic_pdf\model\doc_analyze_by_custom_model.py", line 63, in get_model self._models[key] = custom_model_init(ocr=ocr, show_log=show_log) │ │ │ │ │ └ False │ │ │ │ └ False │ │ │ └ <function custom_model_init at 0x00000259B9480B80> │ │ └ (False, False) │ └ {} └ <magic_pdf.model.doc_analyze_by_custom_model.ModelSingleton object at 0x000002598C6CF7F0>
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\magic_pdf\model\doc_analyze_by_custom_model.py", line 83, in custom_model_init from magic_pdf.model.pdf_extract_kit import CustomPEKModel
File "
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\magic_pdf\model\pdf_extract_kit.py", line 18, in
from ultralytics import YOLO
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\ultralytics__init__.py", line 10, in
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\ultralytics\data__init__.py", line 3, in
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\ultralytics\data\base.py", line 17, in
File "D:\dev-stuff\scoop\apps\anaconda3\current\App\envs\MinerU\lib\site-packages\ultralytics\data\utils.py", line 19, in
看着像matplotlib没装好,卸了重装试试呢
问题一样的,卸载了也不行
magic-pdf pdf-command --pdf "E:\PDF-Extract-Kit\PDF-Extract-Kit\demo\模拟试卷.pdf" --inside_model true
2024-08-04 17:08:38.161 | WARNING | magic_pdf.cli.magicpdf:get_model_json:312 - not found json E:\PDF-Extract-Kit\PDF-Extract-Kit\demo\模拟试卷.json existed
2024-08-04 17:08:38.161 | WARNING | magic_pdf.libs.config_reader:get_local_dir:64 - 'temp-output-dir' not found in magic-pdf.json, use '/tmp' as default
2024-08-04 17:08:38.514 | INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 2119, cid_chars_radio: 0.0
2024-08-04 17:08:40.725 | ERROR | magic_pdf.model.pdf_extract_kit:
File "E:\tensflow\MinerU\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
│ │ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "E:\tensflow\MinerU\Scripts\ma...
│ └ <code object
File "E:\tensflow\MinerU\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
│ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "E:\tensflow\MinerU\Scripts\ma...
└ <code object
File "E:\tensflow\MinerU\Scripts\magic-pdf.exe__main__.py", line 7, in
File "E:\tensflow\MinerU\lib\site-packages\click\core.py", line 1157, in call
return self.main(*args, **kwargs)
│ │ │ └ {}
│ │ └ ()
│ └ <function BaseCommand.main at 0x00000257BDF5D750>
└
File "E:\tensflow\MinerU\lib\site-packages\click\core.py", line 1078, in main
rv = self.invoke(ctx)
│ │ └ <click.core.Context object at 0x00000257BDB59000>
│ └ <function MultiCommand.invoke at 0x00000257BDF5E710>
└
File "E:\tensflow\MinerU\lib\site-packages\click\core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
│ │ │ │ └ <click.core.Context object at 0x00000257FF8733D0>
│ │ │ └ <function Command.invoke at 0x00000257BDF5E200>
│ │ └
File "E:\tensflow\MinerU\lib\site-packages\click\core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
│ │ │ │ │ └ {'pdf': 'E:\PDF-Extract-Kit\PDF-Extract-Kit\demo\模拟试卷.pdf', 'inside_model': True, 'model': None, 'method': 'auto', 'model...
│ │ │ │ └ <click.core.Context object at 0x00000257FF8733D0>
│ │ │ └ <function pdf_command at 0x00000257FF898160>
│ │ └
File "E:\tensflow\MinerU\lib\site-packages\click\core.py", line 783, in invoke return __callback(*args, **kwargs) │ └ {'pdf': 'E:\PDF-Extract-Kit\PDF-Extract-Kit\demo\模拟试卷.pdf', 'inside_model': True, 'model': None, 'method': 'auto', 'model... └ ()
File "E:\tensflow\MinerU\lib\site-packages\magic_pdf\cli\magicpdf.py", line 352, in pdf_command
parse_doc(pdf)
│ └ 'E:\PDF-Extract-Kit\PDF-Extract-Kit\demo\模拟试卷.pdf'
└ <function pdf_command.
File "E:\tensflow\MinerU\lib\site-packages\magic_pdf\cli\magicpdf.py", line 330, in parse_doc do_parse( └ <function do_parse at 0x00000257FF88BA30>
File "E:\tensflow\MinerU\lib\site-packages\magic_pdf\cli\magicpdf.py", line 111, in do_parse pipe.pipe_analyze() │ └ <function UNIPipe.pipe_analyze at 0x00000257FF88A7A0> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x00000257FF872F50>
File "E:\tensflow\MinerU\lib\site-packages\magic_pdf\pipe\UNIPipe.py", line 29, in pipe_analyze self.model_list = doc_analyze(self.pdf_bytes, ocr=False) │ │ │ │ └ b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(zh-CN) /StructTreeRoot 35 0 R/MarkInfo<</Marke... │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x00000257FF872F50> │ │ └ <function doc_analyze at 0x00000257C06B6EF0> │ └ [] └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x00000257FF872F50>
File "E:\tensflow\MinerU\lib\site-packages\magic_pdf\model\doc_analyze_by_custom_model.py", line 103, in doc_analyze custom_model = model_manager.get_model(ocr, show_log) │ │ │ └ False │ │ └ False │ └ <function ModelSingleton.get_model at 0x00000257C06B6E60> └ <magic_pdf.model.doc_analyze_by_custom_model.ModelSingleton object at 0x00000257FF983A60>
File "E:\tensflow\MinerU\lib\site-packages\magic_pdf\model\doc_analyze_by_custom_model.py", line 63, in get_model self._models[key] = custom_model_init(ocr=ocr, show_log=show_log) │ │ │ │ │ └ False │ │ │ │ └ False │ │ │ └ <function custom_model_init at 0x00000257C06B6D40> │ │ └ (False, False) │ └ {} └ <magic_pdf.model.doc_analyze_by_custom_model.ModelSingleton object at 0x00000257FF983A60>
File "E:\tensflow\MinerU\lib\site-packages\magic_pdf\model\doc_analyze_by_custom_model.py", line 83, in custom_model_init from magic_pdf.model.pdf_extract_kit import CustomPEKModel
File "
File "E:\tensflow\MinerU\lib\site-packages\magic_pdf\model\pdf_extract_kit.py", line 18, in
from ultralytics import YOLO
File "E:\tensflow\MinerU\lib\site-packages\ultralytics__init__.py", line 10, in
File "E:\tensflow\MinerU\lib\site-packages\ultralytics\data__init__.py", line 3, in
File "E:\tensflow\MinerU\lib\site-packages\ultralytics\data\base.py", line 17, in
File "E:\tensflow\MinerU\lib\site-packages\ultralytics\data\utils.py", line 19, in
File "E:\tensflow\MinerU\lib\site-packages\ultralytics\nn__init__.py", line 3, in
File "E:\tensflow\MinerU\lib\site-packages\ultralytics\nn\tasks.py", line 10, in
File "E:\tensflow\MinerU\lib\site-packages\ultralytics\nn\modules__init__.py", line 20, in
File "E:\tensflow\MinerU\lib\site-packages\ultralytics\nn\modules\block.py", line 8, in
File "E:\tensflow\MinerU\lib\site-packages\ultralytics\utils__init__.py", line 21, in
File "E:\tensflow\MinerU\lib\site-packages\matplotlib__init.py", line 159, in
File "E:\tensflow\MinerU\lib\site-packages\matplotlib\cbook.py", line 32, in
ImportError: DLL load failed while importing _c_internal_utils: 找不到指定的模块。
2024-08-04 17:08:40.735 | ERROR | magic_pdf.model.pdf_extract_kit:
这个可能是 _c_internal_utils在matplotlib高版本被弃用了(或者别的什么原因),降低matplotlib版本就好了,比如matplotlib=3.7.5
笑川大佐太nb了
这个可能是 _c_internal_utils在matplotlib高版本被弃用了(或者别的什么原因),降低matplotlib版本就好了,比如matplotlib=3.7.5
有点奇怪的是,matplotlib在7月就更新3.9.1了,最近一周我们做了全新环境的安装兼容测试,没有测试出这个问题😂
这个可能是 _c_internal_utils在matplotlib高版本被弃用了(或者别的什么原因),降低matplotlib版本就好了,比如matplotlib=3.7.5
降到了matplotlib-3.8.4,错误消失.非常感谢
查看了一下我的本地开发环境,matplotlib是3.9.1版本,
和官方最新发布的版本一致
去https://github.com/matplotlib/matplotlib 查了下,_c_internal_utils 在最新的代码中是存在的,不应该出现这种import错误
这个情况跟之前遇到的另一个库import失败的情况有点像,报错提示
ImportError: DLL load failed while importing _c_internal_utils: 找不到指定的模块。
可能不是找不到_c_internal_utils 模块,而是在 _c_internal_utils 内部出现错误,需要从本地加载某个dll库的时候发生错误,而这种情况的发生很可能是matplotlib的安装过程中某个需要加载的dll库没有正确释放到正确的路径导致,这时一般卸载相关库再重新安装可以解决。
dll库没有释放到正确路径的原因有很多,有些时候被杀毒软件误识别成木马或病毒被静默删除的情况也会导致该问题。
相同的问题,降低matplotlib版本的确解决了
查看了一下我的本地开发环境,matplotlib是3.9.1版本,
和官方最新发布的版本一致
去https://github.com/matplotlib/matplotlib 查了下,_c_internal_utils 在最新的代码中是存在的,不应该出现这种import错误
确实,这个扩展是存在的。官方的说法是要重新编译或缺少了MSVC Redistribute
查看了一下我的本地开发环境,matplotlib是3.9.1版本,
和官方最新发布的版本一致 去https://github.com/matplotlib/matplotlib 查了下,_c_internal_utils 在最新的代码中是存在的,不应该出现这种import错误
确实,这个扩展是存在的。官方的说法是要重新编译或缺少了MSVC Redistribute
确实,我又去pypi上查看了matplotlib的release记录,在3.9.0版本及之前均提供了windows版本的预编译包,而在3.9.1版本则只提供了linux和macos的预编译包,那么在一些没有MSVC 编译环境的windows设备上安装3.9.1版本会自动通过源码编译安装,而在安装过程中很可能没有提供有效的编译失败提示,导致在这部分设备上显示正常安装了,但是并没有编译出需要加载的dll资源,也就导致了这部分windows设备出现matplotlib库的import失败。 后续我们会将matplotlib的版本锁定在3.9.0之前,防止在这些windows设备上安装失败。
Description of the bug | 错误描述
严格安装步骤安装了所有环境. 再次pip install magic-pdf[full]==0.6.2b1,会显示所有依赖都已经正常. 但是运行demo的命令转换就提示 2024-08-04 11:49:53.272 | ERROR | magic_pdf.model.pdf_extract_kit::24 - DLL load failed while importing _c_internal_utils: 找不到指定的模块。
pip list Package Version
absl-py 2.1.0 aiohappyeyeballs 2.3.4 aiohttp 3.10.0 aiosignal 1.3.1 albucore 0.0.13 albumentations 1.4.12 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 anyio 4.4.0 astor 0.8.1 async-timeout 4.0.3 attrdict 2.0.1 attrs 24.1.0 Babel 2.15.0 bce-python-sdk 0.9.19 beautifulsoup4 4.12.3 black 24.8.0 blinker 1.8.2 boto3 1.34.153 botocore 1.34.153 braceexpand 0.1.7 Brotli 1.1.0 cachetools 5.4.0 certifi 2024.7.4 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 colorama 0.4.6 colorlog 6.8.2 contourpy 1.2.1 cryptography 43.0.0 cssselect 1.2.0 cssutils 2.11.1 cycler 0.12.1 Cython 3.0.10 datasets 2.20.0 decorator 5.1.1 detectron2 0.6 dill 0.3.8 et-xmlfile 1.1.0 eva-decord 0.6.1 eval_type_backport 0.2.0 evaluate 0.4.2 exceptiongroup 1.2.2 fairscale 0.4.13 fast-langdetect 0.2.0 fasttext-wheel 0.9.2 filelock 3.15.4 fire 0.6.0 Flask 3.0.3 flask-babel 4.0.0 fonttools 4.53.1 frozenlist 1.4.1 fsspec 2024.5.0 ftfy 6.2.0 future 1.0.0 fvcore 0.1.5.post20221221 grpcio 1.65.4 h11 0.14.0 httpcore 1.0.5 httpx 0.27.0 huggingface-hub 0.24.5 hydra-core 1.3.2 idna 3.7 imageio 2.34.2 imgaug 0.4.0 intel-openmp 2021.4.0 iopath 0.1.9 itsdangerous 2.2.0 Jinja2 3.1.4 jmespath 1.0.1 joblib 1.4.2 kiwisolver 1.4.5 langdetect 1.0.9 lazy_loader 0.4 lmdb 1.5.1 loguru 0.7.2 lxml 5.2.2 magic-pdf 0.6.2b1 Markdown 3.6 MarkupSafe 2.1.5 matplotlib 3.9.1 mkl 2021.4.0 more-itertools 10.3.0 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 mypy-extensions 1.0.0 networkx 3.3 numpy 1.26.4 omegaconf 2.3.0 opencv-contrib-python 4.6.0.66 opencv-python 4.6.0.66 opencv-python-headless 4.10.0.84 openpyxl 3.1.5 opt-einsum 3.3.0 packaging 24.1 paddleocr 2.7.3 paddlepaddle 2.6.1 pandas 2.2.2 pathspec 0.12.1 pdf2docx 0.5.8 pdfminer.six 20231228 pillow 10.4.0 pip 24.0 platformdirs 4.2.2 portalocker 2.10.1 premailer 3.10.0 protobuf 3.20.2 psutil 6.0.0 py-cpuinfo 9.0.0 pyarrow 17.0.0 pyarrow-hotfix 0.6 pybind11 2.13.1 pyclipper 1.3.0.post5 pycocotools 2.0.8 pycparser 2.22 pycryptodome 3.20.0 pydantic 2.8.2 pydantic_core 2.20.1 PyMuPDF 1.24.9 PyMuPDFb 1.24.9 pyparsing 3.1.2 python-dateutil 2.9.0.post0 python-docx 1.1.2 pytz 2024.1 pywin32 306 PyYAML 6.0.1 rapidfuzz 3.9.5 rarfile 4.2 regex 2024.7.24 requests 2.32.3 robust-downloader 0.0.2 s3transfer 0.10.2 safetensors 0.4.3 scikit-image 0.24.0 scikit-learn 1.5.1 scipy 1.14.0 seaborn 0.13.2 setuptools 69.5.1 shapely 2.0.5 six 1.16.0 sniffio 1.3.1 soupsieve 2.5 sympy 1.13.1 tabulate 0.9.0 tbb 2021.13.0 tensorboard 2.17.0 tensorboard-data-server 0.7.2 termcolor 2.4.0 threadpoolctl 3.5.0 tifffile 2024.7.24 timm 0.9.16 tokenizers 0.19.1 tomli 2.0.1 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tqdm 4.66.5 transformers 4.40.0 typing_extensions 4.12.2 tzdata 2024.1 ultralytics 8.2.72 ultralytics-thop 2.0.0 unimernet 0.1.6 urllib3 2.2.2 visualdl 2.5.3 Wand 0.6.13 wcwidth 0.2.13 webdataset 0.2.86 Werkzeug 3.0.3 wheel 0.43.0 win32-setctime 1.1.0 wordninja 2.0.0 xxhash 3.4.1 yacs 0.1.8 yarl 1.9.4
How to reproduce the bug | 如何复现
magic-pdf pdf-command --pdf "D:/magic-pdf/6105137170.pdf" --inside_model true
Operating system | 操作系统
Windows
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.6.x
Device mode | 设备模式
cpu