opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
GNU Affero General Public License v3.0
10.91k stars 805 forks source link

detectron2 安装之后仍然一直报错 #233

Closed Knightlj closed 1 month ago

Knightlj commented 1 month ago

Description of the bug | 错误描述

2024-07-29 09:40:30.261 | ERROR | magic_pdf.model.pdf_extract_kit::24 - Required dependency not installed, please install by "pip install magic-pdf[full-cpu] detectron2 --extra-index-url"

detectron2 安装成功之后仍然一直报这个错误

How to reproduce the bug | 如何复现


Operating system | 操作系统


Python version | Python 版本


Software version | 软件版本 (magic-pdf --version)


Device mode | 设备模式


myhloli commented 1 month ago

please upload the result of

pip list
zifengdexiatian commented 1 month ago

我也是同样的问题,centos系统,pip list 为pip list Package Version

absl-py 2.1.0 aiohttp 3.9.5 aiosignal 1.3.1 albucore 0.0.12 albumentations 1.4.12 altair 5.3.0 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 anyio 4.4.0 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 astor 0.8.1 asttokens 2.4.1 async-lru 2.0.4 async-timeout 4.0.3 attrdict 2.0.1 attrs 23.2.0 Babel 2.15.0 bce-python-sdk 0.9.17 beautifulsoup4 4.12.3 black 24.4.2 bleach 6.1.0 blinker 1.8.2 boto3 1.34.149 botocore 1.34.149 braceexpand 0.1.7 Brotli 1.1.0 cachetools 5.4.0 certifi 2024.7.4 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 colorlog 6.8.2 comm 0.2.2 contourpy 1.2.1 cryptography 43.0.0 cssselect 1.2.0 cssutils 2.11.1 cycler 0.12.1 Cython 3.0.10 datasets 2.20.0 debugpy 1.8.2 decorator 5.1.1 defusedxml 0.7.1 detectron2 0.6 dill 0.3.8 et-xmlfile 1.1.0 eva-decord 0.6.1 eval_type_backport 0.2.0 evaluate 0.4.2 exceptiongroup 1.2.2 executing 2.0.1 fairscale 0.4.13 fast-langdetect 0.2.1 fastjsonschema 2.20.0 fasttext-wheel 0.9.2 filelock 3.15.4 fire 0.6.0 Flask 3.0.3 flask-babel 4.0.0 fonttools 4.53.1 fqdn 1.5.1 frozenlist 1.4.1 fsspec 2024.5.0 ftfy 6.2.0 future 1.0.0 fvcore 0.1.5.post20221221 gitdb 4.0.11 GitPython 3.1.43 grpcio 1.65.1 h11 0.14.0 httpcore 1.0.5 httpx 0.27.0 huggingface-hub 0.24.2 hydra-core 1.3.2 idna 3.7 imageio 2.34.2 imgaug 0.4.0 iopath 0.1.9 ipykernel 6.29.5 ipython 8.26.0 isoduration 20.11.0 itsdangerous 2.2.0 jedi 0.19.1 Jinja2 3.1.4 jmespath 1.0.1 joblib 1.4.2 json5 0.9.25 jsonpointer 3.0.0 jsonschema 4.23.0 jsonschema-specifications 2023.12.1 jupyter_client 8.6.2 jupyter_core 5.7.2 jupyter-events 0.10.0 jupyter-lsp 2.2.5 jupyter_server 2.14.2 jupyter_server_terminals 0.5.3 jupyterlab 4.2.4 jupyterlab_pygments 0.3.0 jupyterlab_server 2.27.3 kiwisolver 1.4.5 lazy_loader 0.4 lmdb 1.5.1 loguru 0.7.2 lxml 5.2.2 magic-pdf 0.6.1 Markdown 3.6 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.1 matplotlib-inline 0.1.7 mdurl 0.1.2 mistune 3.0.2 more-itertools 10.3.0 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 mypy-extensions 1.0.0 nbclient 0.10.0 nbconvert 7.16.4 nbformat 5.10.4 nest-asyncio 1.6.0 networkx 3.3 nltk 3.8.1 notebook_shim 0.2.4 numpy 1.26.4 nvidia-cublas-cu12 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 nvidia-cufft-cu12 nvidia-curand-cu12 nvidia-cusolver-cu12 nvidia-cusparse-cu12 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.5.82 nvidia-nvtx-cu12 12.1.105 omegaconf 2.3.0 opencv-contrib-python opencv-python opencv-python-headless openpyxl 3.1.5 opt-einsum 3.3.0 overrides 7.7.0 packaging 24.1 paddleocr 2.7.3 paddlepaddle 2.6.1 pandas 2.2.2 pandocfilters 1.5.1 parso 0.8.4 pathspec 0.12.1 pdf2docx 0.5.8 pdf2image 1.17.0 pdfminer.six 20240706 pexpect 4.9.0 pillow 10.4.0 pip 24.0 platformdirs 4.2.2 portalocker 2.10.1 premailer 3.10.0 prometheus_client 0.20.0 prompt_toolkit 3.0.47 protobuf 4.25.4 psutil 6.0.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyarrow 17.0.0 pyarrow-hotfix 0.6 pybind11 2.13.1 pyclipper 1.3.0.post5 pycocotools 2.0.8 pycparser 2.22 pycryptodome 3.20.0 pydantic 2.8.2 pydantic_core 2.20.1 pydeck 0.9.1 Pygments 2.18.0 PyMuPDF 1.24.9 PyMuPDFb 1.24.9 pyparsing 3.1.2 pypdfium2 4.30.0 python-dateutil 2.9.0.post0 python-docx 1.1.2 python-json-logger 2.0.7 pytz 2024.1 PyYAML 6.0.1 pyzmq 26.0.3 rapidfuzz 3.9.4 rarfile 4.2 referencing 0.35.1 regex 2024.7.24 requests 2.32.3 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.7.1 robust-downloader 0.0.2 rpds-py 0.19.1 s3transfer 0.10.2 safetensors 0.4.3 scikit-image 0.24.0 scikit-learn 1.5.1 scipy 1.14.0 seaborn 0.13.2 Send2Trash 1.8.3 setuptools 71.0.4 shapely 2.0.5 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 soupsieve 2.5 stack-data 0.6.3 streamlit 1.37.0 streamlit-drawable-canvas 0.9.3 sympy 1.13.1 tabulate 0.9.0 tenacity 8.5.0 tensorboard 2.17.0 tensorboard-data-server 0.7.2 termcolor 2.4.0 terminado 0.18.1 threadpoolctl 3.5.0 tifffile 2024.7.24 timm 0.9.16 tinycss2 1.3.0 tokenizers 0.19.1 toml 0.10.2 tomli 2.0.1 toolz 0.12.1 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tornado 6.4.1 tqdm 4.66.4 traitlets 5.14.3 transformers 4.40.0 triton 2.3.1 types-python-dateutil typing_extensions 4.12.2 tzdata 2024.1 ultralytics 8.2.68 ultralytics-thop 2.0.0 unimernet 0.1.1 uri-template 1.3.0 urllib3 2.2.2 visualdl 2.5.3 Wand 0.6.13 watchdog 4.0.1 wcwidth 0.2.13 webcolors 24.6.0 webdataset 0.2.86 webencodings 0.5.1 websocket-client 1.8.0 Werkzeug 3.0.3 wheel 0.43.0 wordninja 2.0.0 xxhash 3.4.1 yacs 0.1.8 yarl 1.9.4

myhloli commented 1 month ago

Some people might be missing the libgl and libegl libraries on their Linux systems. On Ubuntu, the command is

sudo apt-get update
sudo apt-get install libgl1-mesa-glx libegl1-mesa-dev

You may use the following commands to install these libraries on CentOS.

sudo yum update
sudo yum install mesa-libGL mesa-libEGL-devel

If they do not work, please continue to provide feedback.

zifengdexiatian commented 1 month ago

@myhloli Thanks, I will try and by the way ask if there is a docker image available

drunkpig commented 1 month ago

@zifengdexiatian For docker file please refer to this link, but we have not tested yet.

zifengdexiatian commented 1 month ago

@drunkpig Thanks a lot!

cyz2453057960 commented 1 month ago

same issue on windows

cyz2453057960 commented 1 month ago

@Knightlj problem solved,

File "C:\ProgramData\Anaconda3\lib\site-packages\ultralytics\", line 21, in import matplotlib.pyplot as plt ModuleNotFoundError: No module named 'matplotlib.pyplot'

I had issue with matplotlib, you can debug the file and see the wrong import

Knightlj commented 1 month ago

please upload the result of

pip list

pip3 list 如下图所示:

Knightlj commented 1 month ago

@myhloli Hi, I have provided the figure of "pip list" result as above reply.

cyz2453057960 commented 1 month ago

@Knightlj you can delete the logger.error and see the exact wrong import module in, I updated the matplotlib and it worked

myhloli commented 1 month ago

@Knightlj may be your Mac has intel cpu,you should install magic-pdf by在intel-cpu-的mac上-安装最新版的完整功能包-magic-pdffull-cpu-06x-不成功

Knightlj commented 1 month ago

@Knightlj may be your Mac has intel cpu,you should install magic-pdf by在intel-cpu-的mac上-安装最新版的完整功能包-magic-pdffull-cpu-06x-不成功

@myhloli 电脑显示是Apple M1 Pro的芯片

myhloli commented 1 month ago



look like you install a base package. please try install full package.

pip install magic-pdf[full-cpu]
Knightlj commented 1 month ago

@myhloli 不支持


@Knightlj image look like you install a base package. please try install full package.

pip install magic-pdf[full-cpu]
myhloli commented 1 month ago


Knightlj commented 1 month ago


@myhloli 按照你刚才的链接中成功执行了“pip3 install magic-pdf[full-cpu]”

现在报另一个错误:2024-07-29 16:20:56.744 | ERROR | magic_pdf.model.pp_structure_v2::8 - paddleocr not installed, please install by "pip install magic-pdf[cpu]" or "pip install magic-pdf[gpu]"

myhloli commented 1 month ago


@myhloli 按照你刚才的链接中成功执行了“pip3 install magic-pdf[full-cpu]”

现在报另一个错误:2024-07-29 16:20:56.744 | ERROR | magic_pdf.model.pp_structure_v2::8 - paddleocr not installed, please install by "pip install magic-pdf[cpu]" or "pip install magic-pdf[gpu]"

This is not the expected result. please try:

magic-pdf --version

if your version is 0.5.x,please feedback.

Knightlj commented 1 month ago


@myhloli 按照你刚才的链接中成功执行了“pip3 install magic-pdf[full-cpu]” 现在报另一个错误:2024-07-29 16:20:56.744 | ERROR | magic_pdf.model.pp_structure_v2::8 - paddleocr not installed, please install by "pip install magic-pdf[cpu]" or "pip install magic-pdf[gpu]"

This is not the expected result. please try:

magic-pdf --version

if your version is 0.5.x,please feedback.

magic-pdf, version 0.5.13

myhloli commented 1 month ago

@Knightlj maybe your python env is x86_64, you could switch a arm64 python to install magic-pdf.

Knightlj commented 1 month ago

@Knightlj maybe your python env is x86_64, you could switch a arm64 python to install magic-pdf.

@myhloli 下图是我电脑及python环境的一些信息,帮忙确认下是否有问题🤦‍♂️

myhloli commented 1 month ago

@Knightlj maybe your python env is x86_64, you could switch a arm64 python to install magic-pdf.

@myhloli 下图是我电脑及python环境的一些信息,帮忙确认下是否有问题🤦‍♂️ image

yep, the python platform is x86_64 you should download and install conda with arm

Knightlj commented 1 month ago

@myhloli 我已经重新安装好arm64的python3, 并且重新执行了pip3 install magic-pdf[cpu] 现在报了一种新的错误: ”ImportError: dlopen(/Users/testjam/my_env/lib/python3.12/site-packages/, 0x0002): tried: '/Users/testjam/my_env/lib/python3.12/site-packages/' (mach-o file, but is an incompatible architecture (have (x86_64), need (arm64e)))“

myhloli commented 1 month ago

@myhloli 我已经重新安装好arm64的python3, 并且重新执行了pip3 install magic-pdf[cpu] 现在报了一种新的错误: ”ImportError: dlopen(/Users/testjam/my_env/lib/python3.12/site-packages/, 0x0002): tried: '/Users/testjam/my_env/lib/python3.12/site-packages/' (mach-o file, but is an incompatible architecture (have (x86_64), need (arm64e)))“


please use conda create a new env with python3.10 and install magic-pdf by

pip install magic-pdf[full-cpu] 
pip install detectron2 --extra-index-url
Knightlj commented 1 month ago

@myhloli @cyz2453057960 放弃在mac上倒腾了,听了cyz的建议,我在ubuntu上倒腾成功啦,感谢两位🙏。另外我发现生成出来的md文件,段落有时不分明,对公式并没有生成latex格式,如下图所示:

MiratPH commented 1 month ago



myhloli commented 1 month ago

@myhloli @cyz2453057960 放弃在mac上倒腾了,听了cyz的建议,我在ubuntu上倒腾成功啦,感谢两位🙏。另外我发现生成出来的md文件,段落有时不分明,对公式并没有生成latex格式,如下图所示:


magic-pdf --version

if result not 0.6.1,maybe wrong again😂

Knightlj commented 1 month ago

@myhloli @cyz2453057960 放弃在mac上倒腾了,听了cyz的建议,我在ubuntu上倒腾成功啦,感谢两位🙏。另外我发现生成出来的md文件,段落有时不分明,对公式并没有生成latex格式,如下图所示:


magic-pdf --version

if result not 0.6.1,maybe wrong again😂

@myhloli 不是0.6.1😭,还是version 0.5.13。不知道怎么办了

Knightlj commented 1 month ago


myhloli commented 1 month ago

@myhloli image

emmm,arm64+linux,many package not support this platform. if you only have arm64 platform,macOS is your first chose system.

qinzhenlove commented 1 month ago

我也是一样的问题, M2 的 Mac, python 3.10.14, magic-pdf 0.6.1。 运行“pip install magic-pdf[full-cpu] pip install detectron2 --extra-index-url” 显示所有的包都“Requirement already satisfied”,但是运行行仍然报错"Required dependency not installed, please install by "pip install magic-pdf[full-cpu] detectron2 --extra-index-url" WX20240729-213737@2x "

qinzhenlove commented 1 month ago

我也是一样的问题, M2 的 Mac, python 3.10.14, magic-pdf 0.6.1。 运行“pip install magic-pdf[full-cpu] pip install detectron2 --extra-index-url” 显示所有的包都“Requirement already satisfied”,但是运行行仍然报错"Required dependency not installed, please install by "pip install magic-pdf[full-cpu] detectron2 --extra-index-url" WX20240729-213737@2x "

pip list Package Version

absl-py 2.1.0 aiohttp 3.9.5 aiosignal 1.3.1 altair 5.3.0 antlr4-python3-runtime 4.9.3 anyio 4.4.0 appnope 0.1.4 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 astor 0.8.1 asttokens 2.4.1 async-lru 2.0.4 async-timeout 4.0.3 attrdict 2.0.1 attrs 23.2.0 Babel 2.15.0 bce-python-sdk 0.9.17 beautifulsoup4 4.12.3 black 24.4.2 bleach 6.1.0 blinker 1.8.2 boto3 1.34.149 botocore 1.34.149 Brotli 1.1.0 cachetools 5.4.0 certifi 2024.7.4 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 colorlog 6.8.2 comm 0.2.2 contourpy 1.2.1 cryptography 43.0.0 cssselect 1.2.0 cssutils 2.11.1 cycler 0.12.1 Cython 3.0.10 datasets 2.20.0 debugpy 1.8.2 decorator 5.1.1 defusedxml 0.7.1 detectron2 0.6 dill 0.3.8 et-xmlfile 1.1.0 evaluate 0.4.2 exceptiongroup 1.2.2 executing 2.0.1 fast-langdetect 0.2.1 fastjsonschema 2.20.0 fasttext-wheel 0.9.2 filelock 3.15.4 fire 0.6.0 Flask 3.0.3 flask-babel 4.0.0 fonttools 4.53.1 fqdn 1.5.1 frozenlist 1.4.1 fsspec 2024.5.0 future 1.0.0 fvcore 0.1.5.post20221221 gitdb 4.0.11 GitPython 3.1.43 grpcio 1.65.1 h11 0.14.0 httpcore 1.0.5 httpx 0.27.0 huggingface-hub 0.24.3 hydra-core 1.3.2 idna 3.7 imageio 2.34.2 imgaug 0.4.0 iopath 0.1.9 ipykernel 6.29.5 ipython 8.26.0 isoduration 20.11.0 itsdangerous 2.2.0 jedi 0.19.1 Jinja2 3.1.4 jmespath 1.0.1 joblib 1.4.2 json5 0.9.25 jsonpointer 3.0.0 jsonschema 4.23.0 jsonschema-specifications 2023.12.1 jupyter_client 8.6.2 jupyter_core 5.7.2 jupyter-events 0.10.0 jupyter-lsp 2.2.5 jupyter_server 2.14.2 jupyter_server_terminals 0.5.3 jupyterlab 4.2.4 jupyterlab_pygments 0.3.0 jupyterlab_server 2.27.3 kiwisolver 1.4.5 lazy_loader 0.4 lmdb 1.5.1 loguru 0.7.2 lxml 5.2.2 magic-pdf 0.6.1 Markdown 3.6 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.1 matplotlib-inline 0.1.7 mdurl 0.1.2 mistune 3.0.2 more-itertools 10.3.0 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 mypy-extensions 1.0.0 nbclient 0.10.0 nbconvert 7.16.4 nbformat 5.10.4 nest-asyncio 1.6.0 networkx 3.3 nltk 3.8.1 notebook_shim 0.2.4 numpy 1.26.4 omegaconf 2.3.0 opencv-contrib-python opencv-python opencv-python-headless openpyxl 3.1.5 opt-einsum 3.3.0 overrides 7.7.0 packaging 24.1 paddleocr 2.7.3 paddlepaddle 2.6.1 pandas 2.2.2 pandocfilters 1.5.1 parso 0.8.4 pathspec 0.12.1 pdf2docx 0.5.8 pdf2image 1.17.0 pdfminer.six 20240706 pexpect 4.9.0 pillow 10.4.0 pip 24.0 platformdirs 4.2.2 portalocker 2.10.1 premailer 3.10.0 prometheus_client 0.20.0 prompt_toolkit 3.0.47 protobuf 4.25.4 psutil 6.0.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyarrow 17.0.0 pyarrow-hotfix 0.6 pybind11 2.13.1 pyclipper 1.3.0.post5 pycocotools 2.0.8 pycparser 2.22 pycryptodome 3.20.0 pydeck 0.9.1 Pygments 2.18.0 PyMuPDF 1.24.9 PyMuPDFb 1.24.9 pyparsing 3.1.2 pypdfium2 4.30.0 python-dateutil 2.9.0.post0 python-docx 1.1.2 python-json-logger 2.0.7 pytz 2024.1 PyYAML 6.0.1 pyzmq 26.0.3 rapidfuzz 3.9.4 rarfile 4.2 referencing 0.35.1 regex 2024.7.24 requests 2.32.3 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.7.1 robust-downloader 0.0.2 rpds-py 0.19.1 s3transfer 0.10.2 scikit-image 0.24.0 scikit-learn 1.5.1 scipy 1.14.0 seaborn 0.13.2 Send2Trash 1.8.3 setuptools 69.5.1 shapely 2.0.5 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 soupsieve 2.5 stack-data 0.6.3 streamlit 1.37.0 streamlit-drawable-canvas 0.9.3 sympy 1.13.1 tabulate 0.9.0 tenacity 8.5.0 tensorboard 2.17.0 tensorboard-data-server 0.7.2 termcolor 2.4.0 terminado 0.18.1 threadpoolctl 3.5.0 tifffile 2024.7.24 tinycss2 1.3.0 toml 0.10.2 tomli 2.0.1 toolz 0.12.1 torch 2.4.0 torchvision 0.19.0 tornado 6.4.1 tqdm 4.66.4 traitlets 5.14.3 types-python-dateutil typing_extensions 4.12.2 tzdata 2024.1 ultralytics 8.2.68 ultralytics-thop 2.0.0 unimernet 0.1.2 uri-template 1.3.0 urllib3 2.2.2 visualdl 2.5.3 wcwidth 0.2.13 webcolors 24.6.0 webencodings 0.5.1 websocket-client 1.8.0 Werkzeug 3.0.3 wheel 0.43.0 wordninja 2.0.0 xxhash 3.4.1 yacs 0.1.8 yarl 1.9.4

Knightlj commented 1 month ago

2024-07-29 22:32:39.654 | WARNING | magic_pdf.cli.magicpdf:get_model_json:310 - not found json /Users/testjam/Desktop/test.json existed 2024-07-29 22:32:39.655 | INFO | magic_pdf.cli.magicpdf:do_parse:91 - local output dir is /tmp/magic-pdf/test/auto 2024-07-29 22:32:42.735 | INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 79, text_len: 56274, cid_chars_radio: 0.0014167862266857962 zsh: illegal hardware instruction magic-pdf pdf-command --pdf /Users/testjam/Desktop/test.pdf --inside_model

@myhloli 额,换了台mac,倒腾下又变成了新的错误😭

myhloli commented 1 month ago

@Knightlj @qinzhenlove @zifengdexiatian We have updated to the 0.6.2b1 release, addressing and resolving the aforementioned issue.