opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.38k stars 853 forks source link

wsl Ubuntu22.04中执行一直失败 #196

Closed goIntoAction closed 1 month ago

goIntoAction commented 1 month ago

Description of the bug | 错误描述

已经安装detectron2,还是报错,自己下载detectron2源码编译还是报错。 2024-07-23 18:09:27.579 | ERROR | magic_pdf.model.pdf_extract_kit::24 - Required dependency not installed, please install by

"pip install magic-pdf[full-cpu] detectron2 --extra-index-url https://myhloli.github.io/wheels/"

How to reproduce the bug | 如何复现

执行magic-pdf pdf-command --pdf必现

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cpu

myhloli commented 1 month ago

那就很有可能不是detectron2的问题,你把pip list的结果上传一份看看

goIntoAction commented 1 month ago

那就很有可能不是detectron2的问题,你把pip list的结果上传一份看看

Package Version


absl-py 2.1.0 aiohttp 3.9.5 aiosignal 1.3.1 albucore 0.0.12 albumentations 1.4.11 altair 5.3.0 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 anyio 4.4.0 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 astor 0.8.1 asttokens 2.4.1 async-lru 2.0.4 async-timeout 4.0.3 attrdict 2.0.1 attrs 23.2.0 Babel 2.15.0 bce-python-sdk 0.9.17 beautifulsoup4 4.12.3 black 24.4.2 bleach 6.1.0 blinker 1.8.2 boto3 1.34.146 botocore 1.34.146 braceexpand 0.1.7 Brotli 1.1.0 cachetools 5.4.0 certifi 2024.7.4 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 colorlog 6.8.2 comm 0.2.2 contourpy 1.2.1 cryptography 43.0.0 cssselect 1.2.0 cssutils 2.11.1 cycler 0.12.1 Cython 3.0.10 datasets 2.20.0 debugpy 1.8.2 decorator 5.1.1 defusedxml 0.7.1 dill 0.3.8 et-xmlfile 1.1.0 eva-decord 0.6.1 eval_type_backport 0.2.0 evaluate 0.4.2 exceptiongroup 1.2.2 executing 2.0.1 fairscale 0.4.13 fast-langdetect 0.2.1 fastjsonschema 2.20.0 fasttext-wheel 0.9.2 filelock 3.15.4 fire 0.6.0 Flask 3.0.3 flask-babel 4.0.0 fonttools 4.53.1 fqdn 1.5.1 frozenlist 1.4.1 fsspec 2024.5.0 ftfy 6.2.0 future 1.0.0 fvcore 0.1.5.post20221221 gitdb 4.0.11 GitPython 3.1.43 grpcio 1.65.1 h11 0.14.0 httpcore 1.0.5 httpx 0.27.0 huggingface-hub 0.24.0 hydra-core 1.3.2 idna 3.7 imageio 2.34.2 imgaug 0.4.0 iopath 0.1.9 ipykernel 6.29.5 ipython 8.26.0 isoduration 20.11.0 itsdangerous 2.2.0 jedi 0.19.1 Jinja2 3.1.4 jmespath 1.0.1 joblib 1.4.2 json5 0.9.25 jsonpointer 3.0.0 jsonschema 4.23.0 jsonschema-specifications 2023.12.1 jupyter_client 8.6.2 jupyter_core 5.7.2 jupyter-events 0.10.0 jupyter-lsp 2.2.5 jupyter_server 2.14.2 jupyter_server_terminals 0.5.3 jupyterlab 4.2.4 jupyterlab_pygments 0.3.0 jupyterlab_server 2.27.3 kiwisolver 1.4.5 lazy_loader 0.4 lmdb 1.5.1 loguru 0.7.2 lxml 5.2.2 magic-pdf 0.6.1 Markdown 3.6 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.1 matplotlib-inline 0.1.7 mdurl 0.1.2 mistune 3.0.2 more-itertools 10.3.0 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 mypy-extensions 1.0.0 nbclient 0.10.0 nbconvert 7.16.4 nbformat 5.10.4 nest-asyncio 1.6.0 networkx 3.3 nltk 3.8.1 notebook_shim 0.2.4 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.5.82 nvidia-nvtx-cu12 12.1.105 omegaconf 2.3.0 opencv-contrib-python 4.6.0.66 opencv-python 4.6.0.66 opencv-python-headless 4.10.0.84 openpyxl 3.1.5 opt-einsum 3.3.0 overrides 7.7.0 packaging 24.1 paddleocr 2.7.3 paddlepaddle 2.6.1 pandas 2.2.2 pandocfilters 1.5.1 parso 0.8.4 pathspec 0.12.1 pdf2docx 0.5.8 pdf2image 1.17.0 pdfminer.six 20240706 pexpect 4.9.0 pillow 10.4.0 pip 24.0 platformdirs 4.2.2 portalocker 2.10.1 premailer 3.10.0 prometheus_client 0.20.0 prompt_toolkit 3.0.47 protobuf 4.25.3 psutil 6.0.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyarrow 17.0.0 pyarrow-hotfix 0.6 pybind11 2.13.1 pyclipper 1.3.0.post5 pycocotools 2.0.8 pycparser 2.22 pycryptodome 3.20.0 pydantic 2.8.2 pydantic_core 2.20.1 pydeck 0.9.1 Pygments 2.18.0 PyMuPDF 1.24.8 PyMuPDFb 1.24.8 pyparsing 3.1.2 pypdfium2 4.30.0 python-dateutil 2.9.0.post0 python-docx 1.1.2 python-json-logger 2.0.7 pytz 2024.1 PyYAML 6.0.1 pyzmq 26.0.3 rapidfuzz 3.9.4 rarfile 4.2 referencing 0.35.1 regex 2024.5.15 requests 2.32.3 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.7.1 robust-downloader 0.0.2 rpds-py 0.19.0 s3transfer 0.10.2 safetensors 0.4.3 scikit-image 0.24.0 scikit-learn 1.5.1 scipy 1.14.0 seaborn 0.13.2 Send2Trash 1.8.3 setuptools 69.5.1 shapely 2.0.5 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 soupsieve 2.5 stack-data 0.6.3 streamlit 1.36.0 streamlit-drawable-canvas 0.9.3 sympy 1.13.1 tabulate 0.9.0 tenacity 8.5.0 tensorboard 2.17.0 tensorboard-data-server 0.7.2 termcolor 2.4.0 terminado 0.18.1 threadpoolctl 3.5.0 tifffile 2024.7.21 timm 0.9.16 tinycss2 1.3.0 tokenizers 0.19.1 toml 0.10.2 tomli 2.0.1 toolz 0.12.1 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tornado 6.4.1 tqdm 4.66.4 traitlets 5.14.3 transformers 4.40.0 triton 2.3.1 types-python-dateutil 2.9.0.20240316 typing_extensions 4.12.2 tzdata 2024.1 ultralytics 8.2.63 ultralytics-thop 2.0.0 unimernet 0.1.1 uri-template 1.3.0 urllib3 2.2.2 visualdl 2.5.3 Wand 0.6.13 watchdog 4.0.1 wcwidth 0.2.13 webcolors 24.6.0 webdataset 0.2.86 webencodings 0.5.1 websocket-client 1.8.0 Werkzeug 3.0.3 wheel 0.43.0 wordninja 2.0.0 xxhash 3.4.1 yacs 0.1.8 yarl 1.9.4 @myhloli 麻烦请看下

myhloli commented 1 month ago

依赖列表看起来没什么问题,可以按照这个回复中的做法试一试https://github.com/opendatalab/MinerU/issues/165#issuecomment-2245202282

goIntoAction commented 1 month ago

依赖列表看起来没什么问题,可以按照这个回复中的做法试一试#165 (comment)

我后面试下用源码执行,把try except去掉看看真正报错。现在先在windows11里面直接跑,没报错,顺利转出来了,但发现有点小瑕疵,书里的代码片段排版不对,出现两种情况,一种就是每个代码字符中间都有空格,类似f u n f o o ( n : I n t ) : I n t,要不就是挤在一起了例如funfoo(n:Int):Int,而且也没有用markdown的代码片段语法包裹起来,测试的书是《kotlin核心编程》 如果这个能解决了,那开发书籍转markdown就很完美,可以接到rag中

myhloli commented 1 month ago

依赖列表看起来没什么问题,可以按照这个回复中的做法试一试#165 (comment)

我后面试下用源码执行,把try except去掉看看真正报错。现在先在windows11里面直接跑,没报错,顺利转出来了,但发现有点小瑕疵,书里的代码片段排版不对,出现两种情况,一种就是每个代码字符中间都有空格,类似f u n f o o ( n : I n t ) : I n t,要不就是挤在一起了例如funfoo(n:Int):Int,而且也没有用markdown的代码片段语法包裹起来,测试的书是《kotlin核心编程》 如果这个能解决了,那开发书籍转markdown就很完美,可以接到rag中

代码段目前没有能力识别,你说的情况应该是被识别成text块了,还有另一种可能是会被识别成table块,就变成截图了。后续我们会优化识别能力,单独对代码段进行处理。