opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
https://opendatalab.com/OpenSourceTools?tool=extract
GNU Affero General Public License v3.0
17.68k stars 1.28k forks source link

0.7.1 版本中gpu运行问题 #659

Closed James-Dao closed 1 month ago

James-Dao commented 1 month ago

Description of the bug | 错误描述

cat /root/magic-pdf.json { "bucket_info":{ "bucket-name-1":["ak", "sk", "endpoint"], "bucket-name-2":["ak", "sk", "endpoint"] }, "models-dir":"/root/mineru/models", "device-mode":"cuda", "table-config": { "model": "TableMaster", "is_table_recog_enable": true, "max_time": 400 } }

pdf中前面部分没有table的时候是正常了,处理速度也不错。 但是遇到table就有上面的报错。 "is_table_recog_enable": true, 这个配置在cpu的时候也是这么设置,可以work。在gpu的环境就报错了。

0: 1888x1472 (no detections), 12.7ms Speed: 19.3ms preprocess, 12.7ms inference, 0.6ms postprocess per image at shape (1, 3, 1888, 1472) 2024-09-09 06:05:04.842 | INFO | magic_pdf.model.pdf_extract_kit:call:200 - formula nums: 0, mfr time: 0.0 2024-09-09 06:05:04.848 | INFO | magic_pdf.model.pdf_extract_kit:call:317 - table cost: 0.0 2024-09-09 06:05:05.123 | INFO | magic_pdf.model.pdf_extract_kit:call:170 - layout detection cost: 0.27

0: 1888x1472 (no detections), 12.6ms Speed: 21.4ms preprocess, 12.6ms inference, 0.6ms postprocess per image at shape (1, 3, 1888, 1472) 2024-09-09 06:05:05.160 | INFO | magic_pdf.model.pdf_extract_kit:call:200 - formula nums: 0, mfr time: 0.0 2024-09-09 06:05:05.168 | INFO | magic_pdf.model.pdf_extract_kit:call:291 - ------------------table recognition processing begins----------------- 2024-09-09 06:05:05.565 | ERROR | magic_pdf.tools.cli:parse_doc:69 - axis 2 is out of bounds for array of dimension 1 Traceback (most recent call last):

File "/usr/local/bin/magic-pdf", line 8, in sys.exit(cli()) │ │ └ │ └ └ <module 'sys' (built-in)> File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) │ │ │ └ {} │ │ └ () │ └ <function BaseCommand.main at 0x7f92f1b3ce50> └ File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) │ │ └ <click.core.Context object at 0x7f92f1f83c10> │ └ <function Command.invoke at 0x7f92f1b3d900> └ File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) │ │ │ │ │ └ {'path': '/root/mineru/testdata/summary.pdf', 'output_dir': '', 'method': 'auto'} │ │ │ │ └ <click.core.Context object at 0x7f92f1f83c10> │ │ │ └ <function cli at 0x7f91a2ddcc10> │ │ └ │ └ <function Context.invoke at 0x7f92f1b3c670> └ <click.core.Context object at 0x7f92f1f83c10> File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) │ └ {'path': '/root/mineru/testdata/summary.pdf', 'output_dir': '', 'method': 'auto'} └ () File "/usr/local/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 75, in cli parse_doc(path) │ └ '/root/mineru/testdata/summary.pdf' └ <function cli..parse_doc at 0x7f92f1d8e8c0>

File "/usr/local/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 60, in parse_doc do_parse( └ <function do_parse at 0x7f91a2ddc5e0> File "/usr/local/lib/python3.10/site-packages/magic_pdf/tools/common.py", line 65, in do_parse pipe.pipe_analyze() │ └ <function UNIPipe.pipe_analyze at 0x7f91a2dcbeb0> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f91a2dba770> File "/usr/local/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py", line 29, in pipe_analyze self.model_list = doc_analyze(self.pdf_bytes, ocr=False) │ │ │ │ └ b'%PDF-1.7\r%\xe2\xe3\xcf\xd3\r\n3449 0 obj\r<</Linearized 1/L 3397591/O 3451/E 169328/N 116/T 3328494/H [ 896 2906]>>\rendob... │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f91a2dba770> │ │ └ <function doc_analyze at 0x7f924c63e4d0> │ └ [] └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f91a2dba770> File "/usr/local/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 119, in doc_analyze result = custom_model(img) │ └ array([[[241, 240, 247], │ [241, 240, 247], │ [241, 240, 247], │ ..., │ [241, 240, 247], │ [241... └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7f91a2e102e0> File "/usr/local/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 298, in call html_code = self.table_model.img2html(new_image) │ │ │ └ <PIL.Image.Image image mode=RGB size=1393x550 at 0x7F8DBD7A64D0> │ │ └ <function ppTableModel.img2html at 0x7f8dbd902320> │ └ <magic_pdf.model.ppTableModel.ppTableModel object at 0x7f8dbc317af0> └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7f91a2e102e0> File "/usr/local/lib/python3.10/site-packages/magic_pdf/model/ppTableModel.py", line 40, in img2html predres, = self.table_sys(image) │ │ └ array([[[241, 240, 247], │ │ [241, 240, 247], │ │ [241, 240, 247], │ │ ..., │ │ [241, 240, 247], │ │ [241... │ └ <paddleocr.ppstructure.table.predict_table.TableSystem object at 0x7f8dbc317a00> └ <magic_pdf.model.ppTableModel.ppTableModel object at 0x7f8dbc317af0> File "/usr/local/lib/python3.10/site-packages/paddleocr/ppstructure/table/predict_table.py", line 86, in call structure_res, elapse = self._structure(copy.deepcopy(img)) │ │ │ │ └ array([[[241, 240, 247], │ │ │ │ [241, 240, 247], │ │ │ │ [241, 240, 247], │ │ │ │ ..., │ │ │ │ [241, 240, 247], │ │ │ │ [241... │ │ │ └ <function deepcopy at 0x7f92f1229d80> │ │ └ <module 'copy' from '/usr/local/lib/python3.10/copy.py'> │ └ <function TableSystem._structure at 0x7f8dbd901e10> └ <paddleocr.ppstructure.table.predict_table.TableSystem object at 0x7f8dbc317a00> File "/usr/local/lib/python3.10/site-packages/paddleocr/ppstructure/table/predict_table.py", line 109, in _structure structure_res, elapse = self.table_structurer(copy.deepcopy(img)) │ │ │ │ └ array([[[241, 240, 247], │ │ │ │ [241, 240, 247], │ │ │ │ [241, 240, 247], │ │ │ │ ..., │ │ │ │ [241, 240, 247], │ │ │ │ [241... │ │ │ └ <function deepcopy at 0x7f92f1229d80> │ │ └ <module 'copy' from '/usr/local/lib/python3.10/copy.py'> │ └ <ppstructure.table.predict_structure.TableStructurer object at 0x7f8cfa0a2020> └ <paddleocr.ppstructure.table.predict_table.TableSystem object at 0x7f8dbc317a00> File "/usr/local/lib/python3.10/site-packages/paddleocr/ppstructure/table/predict_structure.py", line 147, in call post_result = self.postprocess_op(preds, [shape_list]) │ │ │ └ array([[ 550, 1393, 0.34458, 0.34458, 480, 480]]) │ │ └ {'structure_probs': array([], dtype=float32), 'loc_preds': array([], dtype=float32)} │ └ <ppocr.postprocess.table_postprocess.TableMasterLabelDecode object at 0x7f8cf9f16950> └ <ppstructure.table.predict_structure.TableStructurer object at 0x7f8cfa0a2020> File "/usr/local/lib/python3.10/site-packages/paddleocr/ppocr/postprocess/table_postprocess.py", line 56, in call result = self.decode(structure_probs, bbox_preds, shape_list) │ │ │ │ └ array([[ 550, 1393, 0.34458, 0.34458, 480, 480]]) │ │ │ └ array([], dtype=float32) │ │ └ array([], dtype=float32) │ └ <function TableLabelDecode.decode at 0x7f8fd8361900> └ <ppocr.postprocess.table_postprocess.TableMasterLabelDecode object at 0x7f8cf9f16950> File "/usr/local/lib/python3.10/site-packages/paddleocr/ppocr/postprocess/table_postprocess.py", line 69, in decode structure_idx = structure_probs.argmax(axis=2) │ └ <method 'argmax' of 'numpy.ndarray' objects> └ array([], dtype=float32)

numpy.exceptions.AxisError: axis 2 is out of bounds for array of dimension 1

How to reproduce the bug | 如何复现

启用cuda和table能力。 使用auto模式启动magic-pdf。

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.7.x

Device mode | 设备模式

cuda

James-Dao commented 1 month ago

如果"is_table_recog_enable": false,也就是关闭表格能力的时候,就不会报错了。

James-Dao commented 1 month ago

magic-pdf -p /root/mineru/testdata/summary.pdf -m ocr 2024-09-07 15:13:29.784 | INFO | magic_pdf.model.pdf_extract_kit:init:121 - DocAnalysis init, this may take some times. apply_layout: True, apply_formula: True, apply_ocr: True, apply_table: True 2024-09-07 15:13:29.784 | INFO | magic_pdf.model.pdf_extract_kit:init:129 - using device: cuda 2024-09-07 15:13:29.784 | INFO | magic_pdf.model.pdf_extract_kit:init:131 - using models_dir: /root/mineru/models CustomVisionEncoderDecoderModel init CustomMBartForCausalLM init CustomMBartDecoder init [09/07 15:13:57 detectron2]: Rank of current process: 0. World size: 1 [09/07 15:13:58 detectron2]: Environment info:

sys.platform linux Python 3.10.14 (main, Sep 5 2024, 00:26:22) [GCC 12.2.0] numpy 1.26.4 detectron2 0.6 @/usr/local/lib/python3.10/site-packages/detectron2 Compiler GCC 11.4 CUDA compiler not available DETECTRON2_ENV_MODULE PyTorch 2.3.1+cu121 @/usr/local/lib/python3.10/site-packages/torch PyTorch debug build False torch._C._GLIBCXX_USE_CXX11_ABI False GPU available Yes GPU 0 NVIDIA XXX 80GB HBM3 (arch=9.0) Driver version 535.104.12 CUDA_HOME None - invalid! Pillow 10.4.0 torchvision 0.18.1+cu121 @/usr/local/lib/python3.10/site-packages/torchvision torchvision arch flags /usr/local/lib/python3.10/site-packages/torchvision/_C.so fvcore 0.1.5.post20221221 iopath 0.1.9 cv2 4.6.0

PyTorch built with:

GCC 9.3 C++ Version: 201703 Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361) OpenMP 201511 (a.k.a. OpenMP 4.5) LAPACK is enabled (usually provided by MKL) NNPACK is enabled CPU capability usage: AVX512 CUDA Runtime 12.1 NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90 CuDNN 8.9.2 Magma 2.6.1

myhloli commented 1 month ago

如果"is_table_recog_enable": false,也就是关闭表格能力的时候,就不会报错了。

我理解是只要调用到paddle而且使用了gpu加速就会有问题,即使不开表格,在命令行使用-m ocr也会有问题的。核心问题是paddle的gpu库和你的显卡不兼容,可以再一个新的干净环境,只装paddleocr和paddlepaddle-gpu试试能不能正常运行

James-Dao commented 1 month ago

好的,之前的玩法是安装了paddleocr和paddlepaddle,paddleocr和paddlepaddle-gpu三个, 然后卸载paddlepaddle。又安装了paddlepaddle-gpu。

James-Dao commented 1 month ago

不过我在安装magic-pdf的时候,我发现会默认安装paddlepaddle。

James-Dao commented 1 month ago

之前安装的时候是pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/

需要改成pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/

我这边的cuda的版本会比较高一点。

myhloli commented 1 month ago

之前安装的时候是pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/

需要改成pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/

我这边的cuda的版本会比较高一点。

paddle3.0会使用他自带的cuda环境,一般推荐使用11.8来避免与torch的12.1产生冲突

James-Dao commented 1 month ago

如果能在dockerhub搞一个镜像也不错。

myhloli commented 1 month ago

如果能在dockerhub搞一个镜像也不错。

我们有提供dockerfile,可以自己build

James-Dao commented 1 month ago

我试试看看你们的dockerfile,你们的镜像在cpu和gpu上都是能work的?

myhloli commented 1 month ago

我试试看看你们的dockerfile,你们的镜像在cpu和gpu上都是能work的?

镜像需要在gpu设备上运行

James-Dao commented 1 month ago

我用你们的dockerfile跑出来一个镜像。运行起来之后,magic-pdf的命令都没有。

root@mineru-gpu-59fccd86fd-tc8dq:/# ls bin boot dev download_models.py etc home lib lib32 lib64 libx32 magic-pdf.template.json media mnt opt proc requirements-docker.txt root run sbin srv sys tmp usr var root@mineru-gpu-59fccd86fd-tc8dq:/# magic-pdf --help bash: magic-pdf: command not found root@mineru-gpu-59fccd86fd-tc8dq:/#

James-Dao commented 1 month ago

不知道是不是我使用的方式不对? 镜像构建还比较顺利。

myhloli commented 1 month ago

不知道是不是我使用的方式不对? 镜像构建还比较顺利。

在docker里试下pip list,把结果贴上来看看

James-Dao commented 1 month ago

root@mineru-gpu-59fccd86fd-tc8dq:/# pip list Package Version


blinker 1.4 certifi 2024.8.30 charset-normalizer 3.3.2 cryptography 3.4.8 dbus-python 1.2.18 distro 1.7.0 distro-info 1.1+ubuntu0.2 httplib2 0.20.2 idna 3.10 importlib-metadata 4.6.4 jeepney 0.7.1 keyring 23.5.0 launchpadlib 1.10.16 lazr.restfulclient 0.14.4 lazr.uri 1.0.6 modelscope 1.18.1 more-itertools 8.10.0 oauthlib 3.2.0 pip 22.0.2 PyGObject 3.42.1 PyJWT 2.3.0 pyparsing 2.4.7 python-apt 2.4.0+ubuntu4 requests 2.32.3 SecretStorage 3.3.1 setuptools 59.6.0 six 1.16.0 tqdm 4.66.5 unattended-upgrades 0.1 urllib3 2.2.3 wadllib 1.3.6 wheel 0.37.1 zipp 1.0.0

myhloli commented 1 month ago

了解了,应该是安装在虚拟环境里了,但是你进入docker没有通过ENTRYPOINT 进入虚拟环境 可以在docker中尝试 source /opt/mineru_venv/bin/activate 进入虚拟环境

James-Dao commented 1 month ago

(mineru_venv) root@mineru-gpu-59fccd86fd-tc8dq:/# pip list Package Version


absl-py 2.1.0 aiohappyeyeballs 2.4.0 aiohttp 3.10.6 aiosignal 1.3.1 albucore 0.0.17 albumentations 1.4.16 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 anyio 4.6.0 astor 0.8.1 async-timeout 4.0.3 attrdict 2.0.1 attrs 24.2.0 babel 2.16.0 bce-python-sdk 0.9.22 beautifulsoup4 4.12.3 black 24.8.0 blinker 1.8.2 boto3 1.35.26 botocore 1.35.26 braceexpand 0.1.7 Brotli 1.1.0 cachetools 5.5.0 certifi 2024.8.30 cffi 1.17.1 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 colorlog 6.8.2 contourpy 1.3.0 cryptography 43.0.1 cssselect 1.2.0 cssutils 2.11.1 cycler 0.12.1 Cython 3.0.11 datasets 3.0.0 decorator 5.1.1 detectron2 0.6 dill 0.3.8 et-xmlfile 1.1.0 eva-decord 0.6.1 eval_type_backport 0.2.0 evaluate 0.4.3 exceptiongroup 1.2.2 fairscale 0.4.13 fast-langdetect 0.2.0 fasttext-wheel 0.9.2 filelock 3.16.1 fire 0.6.0 Flask 3.0.3 flask-babel 4.0.0 fonttools 4.54.1 frozenlist 1.4.1 fsspec 2024.6.1 ftfy 6.2.3 future 1.0.0 fvcore 0.1.5.post20221221 grpcio 1.66.1 h11 0.14.0 httpcore 1.0.5 httpx 0.27.2 huggingface-hub 0.25.1 hydra-core 1.3.2 idna 3.10 imageio 2.35.1 imgaug 0.4.0 iopath 0.1.9 itsdangerous 2.2.0 Jinja2 3.1.4 jmespath 1.0.1 joblib 1.4.2 kiwisolver 1.4.7 langdetect 1.0.9 lazy_loader 0.4 lmdb 1.5.1 loguru 0.7.2 lxml 5.3.0 magic-pdf 0.8.1 Markdown 3.7 MarkupSafe 2.1.5 matplotlib 3.9.2 more-itertools 10.5.0 mpmath 1.3.0 multidict 6.1.0 multiprocess 0.70.16 mypy-extensions 1.0.0 networkx 3.3 numpy 1.26.4 nvidia-cublas-cu11 11.11.3.6 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu11 11.8.87 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu11 11.8.89 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu11 11.8.89 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu11 8.7.0.84 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu11 10.9.0.58 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu11 10.3.0.86 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu11 11.4.1.48 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu11 11.7.5.86 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu11 2.19.3 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.68 nvidia-nvtx-cu11 11.8.86 nvidia-nvtx-cu12 12.1.105 omegaconf 2.3.0 opencv-contrib-python 4.6.0.66 opencv-python 4.6.0.66 opencv-python-headless 4.10.0.84 openpyxl 3.1.5 opt-einsum 3.3.0 packaging 24.1 paddleocr 2.7.3 paddlepaddle 3.0.0b1 paddlepaddle-gpu 3.0.0b1 pandas 2.2.3 pathspec 0.12.1 pdf2docx 0.5.8 pdfminer.six 20231228 pillow 10.4.0 pip 24.2 platformdirs 4.3.6 portalocker 2.10.1 premailer 3.10.0 protobuf 5.28.2 psutil 6.0.0 py-cpuinfo 9.0.0 pyarrow 17.0.0 pybind11 2.13.6 pyclipper 1.3.0.post5 pycocotools 2.0.8 pycparser 2.22 pycryptodome 3.20.0 pydantic 2.7.4 pydantic_core 2.18.4 PyMuPDF 1.24.10 PyMuPDFb 1.24.10 pypandoc 1.13 pyparsing 3.1.4 python-dateutil 2.9.0.post0 python-docx 1.1.2 pytz 2024.2 PyYAML 6.0.2 RapidFuzz 3.10.0 rarfile 4.2 regex 2024.9.11 requests 2.32.3 robust-downloader 0.0.2 s3transfer 0.10.2 safetensors 0.4.5 scikit-image 0.24.0 scikit-learn 1.5.2 scipy 1.14.1 seaborn 0.13.2 setuptools 59.6.0 shapely 2.0.6 six 1.16.0 sniffio 1.3.1 soupsieve 2.6 struct-eqtable 0.1.0 sympy 1.13.3 tabulate 0.9.0 tensorboard 2.17.1 tensorboard-data-server 0.7.2 termcolor 2.4.0 threadpoolctl 3.5.0 tifffile 2024.9.20 timm 0.9.16 tokenizers 0.19.1 tomli 2.0.1 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tqdm 4.66.5 transformers 4.40.0 triton 2.3.1 typing_extensions 4.12.2 tzdata 2024.2 ultralytics 8.2.100 ultralytics-thop 2.0.8 unimernet 0.1.6 urllib3 2.2.3 visualdl 2.5.3 Wand 0.6.13 wcwidth 0.2.13 webdataset 0.2.100 Werkzeug 3.0.4 wordninja 2.0.0 xxhash 3.5.0 yacs 0.1.8 yarl 1.12.1

myhloli commented 1 month ago

这个环境是对的了,你可以在这个环境里正常使用mineru了

James-Dao commented 1 month ago
magic-pdf -p /root/data/haiguan-12-19.pdf -o /root/data/ 2024-09-26 03:31:28.304 INFO magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 13040, cid_chars_radio: 0.0 Creating new Ultralytics Settings v0.0.6 file ✅ View Ultralytics Settings with 'yolo settings' or at '/root/.config/Ultralytics/settings.json' Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings. 2024-09-26 03:31:40.734 INFO magic_pdf.model.pdf_extract_kit:init:180 - DocAnalysis init, this may take some times. apply_layout: True, apply_formula: True, apply_ocr: False, apply_table: True 2024-09-26 03:31:40.734 INFO magic_pdf.model.pdf_extract_kit:init:188 - using device: cuda 2024-09-26 03:31:40.734 INFO magic_pdf.model.pdf_extract_kit:init:190 - using models_dir: /root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models CustomVisionEncoderDecoderModel init CustomMBartForCausalLM init CustomMBartDecoder init [09/26 03:32:01 detectron2]: Rank of current process: 0. World size: 1 [09/26 03:32:02 detectron2]: Environment info:

sys.platform linux Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] numpy 1.26.4 detectron2 0.6 @/opt/mineru_venv/lib/python3.10/site-packages/detectron2 Compiler GCC 11.4 CUDA compiler not available DETECTRON2_ENV_MODULE PyTorch 2.3.1+cu121 @/opt/mineru_venv/lib/python3.10/site-packages/torch PyTorch debug build False torch._C._GLIBCXX_USE_CXX11_ABI False GPU available Yes GPU 0 NVIDIA xxx 80GB HBM3 (arch=9.0) Driver version 535.104.12 CUDA_HOME None - invalid! Pillow 10.4.0 torchvision 0.18.1+cu121 @/opt/mineru_venv/lib/python3.10/site-packages/torchvision torchvision arch flags /opt/mineru_venv/lib/python3.10/site-packages/torchvision/_C.so fvcore 0.1.5.post20221221 iopath 0.1.9 cv2 4.6.0


PyTorch built with:

[09/26 03:32:02 detectron2]: Command line arguments: {'config_file': '/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml', 'resume': False, 'eval_only': False, 'num_gpus': 1, 'num_machines': 1, 'machine_rank': 0, 'dist_url': 'tcp://127.0.0.1:57823', 'opts': ['MODEL.WEIGHTS', '/root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models/Layout/model_final.pth']} [09/26 03:32:02 detectron2]: Contents of args.config_file=/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml: AUG: DETR: true CACHE_DIR: ~/cache/huggingface CUDNN_BENCHMARK: false DATALOADER: ASPECT_RATIO_GROUPING: true FILTER_EMPTY_ANNOTATIONS: false NUM_WORKERS: 4 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST:

[09/26 03:32:04 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models/Layout/model_final.pth ... [09/26 03:32:04 fvcore.common.checkpoint]: [Checkpointer] Loading from /root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models/Layout/model_final.pth ... 2024-09-26 03:32:11.768 | INFO | magic_pdf.model.pdf_extract_kit:init:248 - DocAnalysis init done! 2024-09-26 03:32:11.769 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:98 - model init cost: 43.46337127685547 2024-09-26 03:32:13.837 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 1.59

0: 1888x1344 (no detections), 149.0ms Speed: 21.3ms preprocess, 149.0ms inference, 1.2ms postprocess per image at shape (1, 3, 1888, 1344) 2024-09-26 03:32:14.758 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 0, mfr time: 0.0 2024-09-26 03:32:14.769 | INFO | magic_pdf.model.pdf_extract_kit:call:407 - table cost: 0.0 2024-09-26 03:32:15.014 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 0.24

0: 1888x1344 4 embeddings, 12.6ms Speed: 17.2ms preprocess, 12.6ms inference, 1.7ms postprocess per image at shape (1, 3, 1888, 1344) 2024-09-26 03:32:16.409 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 4, mfr time: 1.33 2024-09-26 03:32:16.432 | INFO | magic_pdf.model.pdf_extract_kit:call:380 - ------------------table recognition processing begins----------------- 2024-09-26 03:32:16.654 | ERROR | magic_pdf.tools.cli:parse_doc:96 - axis 2 is out of bounds for array of dimension 1 Traceback (most recent call last):

File "/opt/mineru_venv/bin/magic-pdf", line 8, in sys.exit(cli()) │ │ └ │ └ └ <module 'sys' (built-in)> File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) │ │ │ └ {} │ │ └ () │ └ <function BaseCommand.main at 0x7f118c8553f0> └ File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) │ │ └ <click.core.Context object at 0x7f118ca601f0> │ └ <function Command.invoke at 0x7f118c855ea0> └ File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) │ │ │ │ │ └ {'path': '/root/data/haiguan-12-19.pdf', 'output_dir': '/root/data/', 'method': 'auto', 'debug_able': False, 'start_page_id':... │ │ │ │ └ <click.core.Context object at 0x7f118ca601f0> │ │ │ └ <function cli at 0x7f103202d090> │ │ └ │ └ <function Context.invoke at 0x7f118c854c10> └ <click.core.Context object at 0x7f118ca601f0> File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) │ └ {'path': '/root/data/haiguan-12-19.pdf', 'output_dir': '/root/data/', 'method': 'auto', 'debug_able': False, 'start_page_id':... └ () File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 102, in cli parse_doc(path) │ └ '/root/data/haiguan-12-19.pdf' └ <function cli..parse_doc at 0x7f118ca7d750>

File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 84, in parse_doc do_parse( └ <function do_parse at 0x7f11880cfd90> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/tools/common.py", line 79, in do_parse pipe.pipe_analyze() │ └ <function UNIPipe.pipe_analyze at 0x7f103202c9d0> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f10320153f0> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py", line 30, in pipe_analyze self.model_list = doc_analyze(self.pdf_bytes, ocr=False, │ │ │ │ └ b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4\xc6\n3 0 obj\n<< /Filter /FlateDecode /Length 234 >>\nstream\nx\x01\x9d... │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f10320153f0> │ │ └ <function doc_analyze at 0x7f10e7272320> │ └ [] └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f10320153f0> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 129, in doc_analyze result = custom_model(img) │ └ array([[[255, 255, 255], │ [255, 255, 255], │ [255, 255, 255], │ ..., │ [255, 255, 255], │ [255... └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7f1031a54430> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 387, in call html_code = self.table_model.img2html(new_image) │ │ │ └ <PIL.Image.Image image mode=RGB size=1245x1622 at 0x7F0C637BE620> │ │ └ <function ppTableModel.img2html at 0x7f0c6399c1f0> │ └ <magic_pdf.model.ppTableModel.ppTableModel object at 0x7f0c4c5e7910> └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7f1031a54430> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/ppTableModel.py", line 40, in img2html predres, = self.table_sys(image) │ │ └ array([[[255, 255, 255], │ │ [255, 255, 255], │ │ [255, 255, 255], │ │ ..., │ │ [255, 255, 255], │ │ [255... │ └ <paddleocr.ppstructure.table.predict_table.TableSystem object at 0x7f0c4c5e78e0> └ <magic_pdf.model.ppTableModel.ppTableModel object at 0x7f0c4c5e7910> File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/predict_table.py", line 86, in call structure_res, elapse = self._structure(copy.deepcopy(img)) │ │ │ │ └ array([[[255, 255, 255], │ │ │ │ [255, 255, 255], │ │ │ │ [255, 255, 255], │ │ │ │ ..., │ │ │ │ [255, 255, 255], │ │ │ │ [255... │ │ │ └ <function deepcopy at 0x7f118bfd6950> │ │ └ <module 'copy' from '/usr/lib/python3.10/copy.py'> │ └ <function TableSystem._structure at 0x7f0c6397bc70> └ <paddleocr.ppstructure.table.predict_table.TableSystem object at 0x7f0c4c5e78e0> File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/predict_table.py", line 109, in _structure structure_res, elapse = self.table_structurer(copy.deepcopy(img)) │ │ │ │ └ array([[[255, 255, 255], │ │ │ │ [255, 255, 255], │ │ │ │ [255, 255, 255], │ │ │ │ ..., │ │ │ │ [255, 255, 255], │ │ │ │ [255... │ │ │ └ <function deepcopy at 0x7f118bfd6950> │ │ └ <module 'copy' from '/usr/lib/python3.10/copy.py'> │ └ <ppstructure.table.predict_structure.TableStructurer object at 0x7f0aff2eead0> └ <paddleocr.ppstructure.table.predict_table.TableSystem object at 0x7f0c4c5e78e0> File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/predict_structure.py", line 147, in call post_result = self.postprocess_op(preds, [shape_list]) │ │ │ └ array([[ 1622, 1245, 0.29593, 0.29593, 480, 480]]) │ │ └ {'structure_probs': array([], dtype=float32), 'loc_preds': array([], dtype=float32)} │ └ <ppocr.postprocess.table_postprocess.TableMasterLabelDecode object at 0x7f0aff2ece20> └ <ppstructure.table.predict_structure.TableStructurer object at 0x7f0aff2eead0> File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppocr/postprocess/table_postprocess.py", line 56, in call result = self.decode(structure_probs, bbox_preds, shape_list) │ │ │ │ └ array([[ 1622, 1245, 0.29593, 0.29593, 480, 480]]) │ │ │ └ array([], dtype=float32) │ │ └ array([], dtype=float32) │ └ <function TableLabelDecode.decode at 0x7f0c660f3eb0> └ <ppocr.postprocess.table_postprocess.TableMasterLabelDecode object at 0x7f0aff2ece20> File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppocr/postprocess/table_postprocess.py", line 69, in decode structure_idx = structure_probs.argmax(axis=2) │ └ <method 'argmax' of 'numpy.ndarray' objects> └ array([], dtype=float32)

numpy.exceptions.AxisError: axis 2 is out of bounds for array of dimension 1

James-Dao commented 1 month ago

和上次遇到的问题是一样的

James-Dao commented 1 month ago

cuda,遇到表格就出错了

myhloli commented 1 month ago

cuda,遇到表格就出错了

感觉是某个特殊的表的问题,你可以从仓库下载并解析https://github.com/opendatalab/MinerU/blob/master/demo/small_ocr.pdf 测试paddlegpu组件能否正常工作,如果small_ocr可以正常解析的话说明环境是ok的

James-Dao commented 1 month ago

ok。 这个环境是k8s的环境,在上面运行了vllm,ray的gpu应用都是这么玩的。也都是正常的。

James-Dao commented 1 month ago
magic-pdf -p /root/data/small_ocr.pdf -o /root/data/ 2024-09-26 03:48:21.076 INFO magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 8, cid_chars_radio: 0.0 2024-09-26 03:48:21.084 WARNING magic_pdf.filter.pdf_classify_by_type:classify:334 - pdf is not classified by area and text_len, by_image_area: False, by_text: False, by_avg_words: False, by_img_num: True, by_text_layout: False, by_img_narrow_strips: False, by_invalid_chars: True 2024-09-26 03:48:33.188 INFO magic_pdf.model.pdf_extract_kit:init:180 - DocAnalysis init, this may take some times. apply_layout: True, apply_formula: True, apply_ocr: True, apply_table: True 2024-09-26 03:48:33.189 INFO magic_pdf.model.pdf_extract_kit:init:188 - using device: cuda 2024-09-26 03:48:33.189 INFO magic_pdf.model.pdf_extract_kit:init:190 - using models_dir: /root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models CustomVisionEncoderDecoderModel init CustomMBartForCausalLM init CustomMBartDecoder init [09/26 03:48:51 detectron2]: Rank of current process: 0. World size: 1 [09/26 03:48:52 detectron2]: Environment info:

sys.platform linux Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] numpy 1.26.4 detectron2 0.6 @/opt/mineru_venv/lib/python3.10/site-packages/detectron2 Compiler GCC 11.4 CUDA compiler not available DETECTRON2_ENV_MODULE PyTorch 2.3.1+cu121 @/opt/mineru_venv/lib/python3.10/site-packages/torch PyTorch debug build False torch._C._GLIBCXX_USE_CXX11_ABI False GPU available Yes GPU 0 NVIDIA H100 80GB HBM3 (arch=9.0) Driver version 535.104.12 CUDA_HOME None - invalid! Pillow 10.4.0 torchvision 0.18.1+cu121 @/opt/mineru_venv/lib/python3.10/site-packages/torchvision torchvision arch flags /opt/mineru_venv/lib/python3.10/site-packages/torchvision/_C.so fvcore 0.1.5.post20221221 iopath 0.1.9 cv2 4.6.0


PyTorch built with:

[09/26 03:48:52 detectron2]: Command line arguments: {'config_file': '/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml', 'resume': False, 'eval_only': False, 'num_gpus': 1, 'num_machines': 1, 'machine_rank': 0, 'dist_url': 'tcp://127.0.0.1:57823', 'opts': ['MODEL.WEIGHTS', '/root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models/Layout/model_final.pth']} [09/26 03:48:52 detectron2]: Contents of args.config_file=/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml: AUG: DETR: true CACHE_DIR: ~/cache/huggingface CUDNN_BENCHMARK: false DATALOADER: ASPECT_RATIO_GROUPING: true FILTER_EMPTY_ANNOTATIONS: false NUM_WORKERS: 4 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST:

[09/26 03:48:54 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models/Layout/model_final.pth ... [09/26 03:48:54 fvcore.common.checkpoint]: [Checkpointer] Loading from /root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models/Layout/model_final.pth ... download https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_det_infer.tar to /root/.paddleocr/whl/det/ch/ch_PP-OCRv4_det_infer/ch_PP-OCRv4_det_infer.tar 2024-09-26 03:48:58.930 | ERROR | magic_pdf.tools.cli:parse_doc:96 - HTTPSConnectionPool(host='paddleocr.bj.bcebos.com', port=443): Max retries exceeded with url: /PP-OCRv4/chinese/ch_PP-OCRv4_det_infer.tar (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbef0c66ec0>: Failed to resolve 'paddleocr.bj.bcebos.com' ([Errno -3] Temporary failure in name resolution)")) Traceback (most recent call last):

File "/opt/mineru_venv/lib/python3.10/site-packages/urllib3/connection.py", line 199, in _new_conn sock = connection.create_connection( │ └ <function create_connection at 0x7fc460675ab0> └ <module 'urllib3.util.connection' from '/opt/mineru_venv/lib/python3.10/site-packages/urllib3/util/connection.py'> File "/opt/mineru_venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 60, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): │ │ │ │ │ │ └ <SocketKind.SOCK_STREAM: 1> │ │ │ │ │ └ <module 'socket' from '/usr/lib/python3.10/socket.py'> │ │ │ │ └ <AddressFamily.AF_UNSPEC: 0> │ │ │ └ 443 │ │ └ 'paddleocr.bj.bcebos.com' │ └ <function getaddrinfo at 0x7fc461023ac0> └ <module 'socket' from '/usr/lib/python3.10/socket.py'> File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): │ │ │ │ │ │ │ └ 0 │ │ │ │ │ │ └ 0 │ │ │ │ │ └ <SocketKind.SOCK_STREAM: 1> │ │ │ │ └ <AddressFamily.AF_UNSPEC: 0> │ │ │ └ 443 │ │ └ 'paddleocr.bj.bcebos.com' │ └ └ <module '_socket' (built-in)>

socket.gaierror: [Errno -3] Temporary failure in name resolution

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/opt/mineru_venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 789, in urlopen response = self._make_request( │ └ <function HTTPConnectionPool._make_request at 0x7fc4605c4670> └ <urllib3.connectionpool.HTTPSConnectionPool object at 0x7fbef0c67b50> File "/opt/mineru_venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 490, in _make_request raise new_e └ NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbef0c66ec0>: Failed to resolve 'paddleocr.bj.bcebos.co... File "/opt/mineru_venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request self._validate_conn(conn) │ │ └ <urllib3.connection.HTTPSConnection object at 0x7fbef0c66ec0> │ └ <function HTTPSConnectionPool._validate_conn at 0x7fc4605c4a60> └ <urllib3.connectionpool.HTTPSConnectionPool object at 0x7fbef0c67b50> File "/opt/mineru_venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn conn.connect() │ └ <function HTTPSConnection.connect at 0x7fc4605bcee0> └ <urllib3.connection.HTTPSConnection object at 0x7fbef0c66ec0> File "/opt/mineru_venv/lib/python3.10/site-packages/urllib3/connection.py", line 693, in connect self.sock = sock = self._new_conn() │ │ │ └ <function HTTPConnection._new_conn at 0x7fc4605bc5e0> │ │ └ <urllib3.connection.HTTPSConnection object at 0x7fbef0c66ec0> │ └ None └ <urllib3.connection.HTTPSConnection object at 0x7fbef0c66ec0> File "/opt/mineru_venv/lib/python3.10/site-packages/urllib3/connection.py", line 206, in _new_conn raise NameResolutionError(self.host, self, e) from e │ │ │ └ <urllib3.connection.HTTPSConnection object at 0x7fbef0c66ec0> │ │ └ <property object at 0x7fc46059c220> │ └ <urllib3.connection.HTTPSConnection object at 0x7fbef0c66ec0> └ <class 'urllib3.exceptions.NameResolutionError'>

urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7fbef0c66ec0>: Failed to resolve 'paddleocr.bj.bcebos.com' ([Errno -3] Temporary failure in name resolution)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/opt/mineru_venv/lib/python3.10/site-packages/requests/adapters.py", line 667, in send resp = conn.urlopen( │ └ <function HTTPConnectionPool.urlopen at 0x7fc4605c4820> └ <urllib3.connectionpool.HTTPSConnectionPool object at 0x7fbef0c67b50> File "/opt/mineru_venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 843, in urlopen retries = retries.increment( │ └ <function Retry.increment at 0x7fc460698040> └ Retry(total=0, connect=None, read=False, redirect=None, status=None) File "/opt/mineru_venv/lib/python3.10/site-packages/urllib3/util/retry.py", line 519, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] │ │ │ │ └ NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbef0c66ec0>: Failed to resolve 'paddleocr.bj.bcebos.co... │ │ │ └ NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbef0c66ec0>: Failed to resolve 'paddleocr.bj.bcebos.co... │ │ └ '/PP-OCRv4/chinese/ch_PP-OCRv4_det_infer.tar' │ └ <urllib3.connectionpool.HTTPSConnectionPool object at 0x7fbef0c67b50> └ <class 'urllib3.exceptions.MaxRetryError'>

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='paddleocr.bj.bcebos.com', port=443): Max retries exceeded with url: /PP-OCRv4/chinese/ch_PP-OCRv4_det_infer.tar (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbef0c66ec0>: Failed to resolve 'paddleocr.bj.bcebos.com' ([Errno -3] Temporary failure in name resolution)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/opt/mineru_venv/bin/magic-pdf", line 8, in sys.exit(cli()) │ │ └ │ └ └ <module 'sys' (built-in)> File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) │ │ │ └ {} │ │ └ () │ └ <function BaseCommand.main at 0x7fc46116d3f0> └ File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) │ │ └ <click.core.Context object at 0x7fc4613781f0> │ └ <function Command.invoke at 0x7fc46116dea0> └ File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) │ │ │ │ │ └ {'path': '/root/data/small_ocr.pdf', 'output_dir': '/root/data/', 'method': 'auto', 'debug_able': False, 'start_page_id': 0, ... │ │ │ │ └ <click.core.Context object at 0x7fc4613781f0> │ │ │ └ <function cli at 0x7fc30694d090> │ │ └ │ └ <function Context.invoke at 0x7fc46116cc10> └ <click.core.Context object at 0x7fc4613781f0> File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) │ └ {'path': '/root/data/small_ocr.pdf', 'output_dir': '/root/data/', 'method': 'auto', 'debug_able': False, 'start_page_id': 0, ... └ () File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 102, in cli parse_doc(path) │ └ '/root/data/small_ocr.pdf' └ <function cli..parse_doc at 0x7fc461395750>

File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 84, in parse_doc do_parse( └ <function do_parse at 0x7fc45cabfd90> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/tools/common.py", line 79, in do_parse pipe.pipe_analyze() │ └ <function UNIPipe.pipe_analyze at 0x7fc30694c9d0> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fc3069313c0> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py", line 33, in pipe_analyze self.model_list = doc_analyze(self.pdf_bytes, ocr=True, │ │ │ │ └ b'%PDF-1.7\r\n%\xa1\xb3\xc5\xd7\r\n1 0 obj\r\n<</Pages 2 0 R /Type/Catalog>>\r\nendobj\r\n2 0 obj\r\n<</Count 8/Kids[ 4 0 R ... │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fc3069313c0> │ │ └ <function doc_analyze at 0x7fc3bbb7a320> │ └ [] └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fc3069313c0> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 110, in doc_analyze custom_model = model_manager.get_model(ocr, show_log) │ │ │ └ False │ │ └ True │ └ <function ModelSingleton.get_model at 0x7fc3bbb7a290> └ <magic_pdf.model.doc_analyze_by_custom_model.ModelSingleton object at 0x7fc306410b20> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 63, in get_model self._models[key] = custom_model_init(ocr=ocr, show_log=show_log) │ │ │ │ │ └ False │ │ │ │ └ True │ │ │ └ <function custom_model_init at 0x7fc3bbb7a170> │ │ └ (True, False) │ └ {} └ <magic_pdf.model.doc_analyze_by_custom_model.ModelSingleton object at 0x7fc306410b20> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 93, in custom_model_init custom_model = CustomPEKModel(model_input) │ └ {'ocr': True, 'show_log': False, 'models_dir': '/root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models', 'device': 'c... └ <class 'magic_pdf.model.pdf_extract_kit.CustomPEKModel'> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 230, in init self.ocr_model = atom_model_manager.get_atom_model( │ │ └ <function AtomModelSingleton.get_atom_model at 0x7fbf381443a0> │ └ <magic_pdf.model.pdf_extract_kit.AtomModelSingleton object at 0x7fbf382c7e20> └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7fc306410f40> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 112, in get_atom_model self._models[atom_model_name] = atom_model_init(model_name=atom_model_name, kwargs) │ │ │ │ │ └ {'ocr_show_log': False, 'det_db_box_thresh': 0.3} │ │ │ │ └ 'ocr' │ │ │ └ <function atom_model_init at 0x7fbf381440d0> │ │ └ 'ocr' │ └ {'mfd': YOLO( │ (model): DetectionModel( │ (model): Sequential( │ (0): Conv( │ (conv): Conv2d(3, 64, kernel_size=... └ <magic_pdf.model.pdf_extract_kit.AtomModelSingleton object at 0x7fbf382c7e20> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 135, in atom_model_init atom_model = ocr_model_init( └ <function ocr_model_init at 0x7fbf38144040> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 78, in ocr_model_init model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=det_db_box_thresh) │ │ └ 0.3 │ └ False └ <class 'magic_pdf.model.pek_sub_modules.self_modify.ModifiedPaddleOCR'> File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/paddleocr.py", line 599, in init maybe_download(params.det_model_dir, det_url) │ │ │ └ 'https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_det_infer.tar' │ │ └ '/root/.paddleocr/whl/det/ch/ch_PP-OCRv4_det_infer' │ └ Namespace(help='==SUPPRESS==', use_gpu=True, use_xpu=False, use_npu=False, ir_optim=True, use_tensorrt=False, min_subgraph_si... └ <function maybe_download at 0x7fbf382ce830> File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppocr/utils/network.py", line 55, in maybe_download download_with_progressbar(url, tmp_path) │ │ └ '/root/.paddleocr/whl/det/ch/ch_PP-OCRv4_det_infer/ch_PP-OCRv4_det_infer.tar' │ └ 'https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_det_infer.tar' └ <function download_with_progressbar at 0x7fbf382ce7a0> File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppocr/utils/network.py", line 28, in download_with_progressbar response = requests.get(url, stream=True) │ │ └ 'https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_det_infer.tar' │ └ <function get at 0x7fc3ba82c670> └ <module 'requests' from '/opt/mineru_venv/lib/python3.10/site-packages/requests/init.py'> File "/opt/mineru_venv/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, kwargs) │ │ │ └ {'stream': True} │ │ └ None │ └ 'https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_det_infer.tar' └ <function request at 0x7fc3ba7ea440> File "/opt/mineru_venv/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) │ │ │ │ └ {'params': None, 'stream': True} │ │ │ └ 'https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_det_infer.tar' │ │ └ 'get' │ └ <function Session.request at 0x7fc3ba80fd90> └ <requests.sessions.Session object at 0x7fbef0bc54e0> File "/opt/mineru_venv/lib/python3.10/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, send_kwargs) │ │ │ └ {'timeout': None, 'allow_redirects': True, 'proxies': OrderedDict(), 'stream': True, 'verify': True, 'cert': None} │ │ └ <PreparedRequest [GET]> │ └ <function Session.send at 0x7fc3ba82c280> └ <requests.sessions.Session object at 0x7fbef0bc54e0> File "/opt/mineru_venv/lib/python3.10/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, kwargs) │ │ │ └ {'timeout': None, 'proxies': OrderedDict(), 'stream': True, 'verify': True, 'cert': None} │ │ └ <PreparedRequest [GET]> │ └ <function HTTPAdapter.send at 0x7fc3ba80f6d0> └ <requests.adapters.HTTPAdapter object at 0x7fbef0bc6410> File "/opt/mineru_venv/lib/python3.10/site-packages/requests/adapters.py", line 700, in send raise ConnectionError(e, request=request) │ └ <PreparedRequest [GET]> └ <class 'requests.exceptions.ConnectionError'>

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='paddleocr.bj.bcebos.com', port=443): Max retries exceeded with url: /PP-OCRv4/chinese/ch_PP-OCRv4_det_infer.tar (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fbef0c66ec0>: Failed to resolve 'paddleocr.bj.bcebos.com' ([Errno -3] Temporary failure in name resolution)"))

James-Dao commented 1 month ago

我的机器连不了外网

myhloli commented 1 month ago

下载https://huggingface.co/spaces/opendatalab/MinerU/tree/main/paddleocr 这个目录,把这个目录中的whl文件夹拷贝到docker的 /root/.paddleocr/中

James-Dao commented 1 month ago
magic-pdf -p /root/data/small_ocr.pdf -o /root/data/ 2024-09-26 04:02:42.230 INFO magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 8, cid_chars_radio: 0.0 2024-09-26 04:02:42.235 WARNING magic_pdf.filter.pdf_classify_by_type:classify:334 - pdf is not classified by area and text_len, by_image_area: False, by_text: False, by_avg_words: False, by_img_num: True, by_text_layout: False, by_img_narrow_strips: False, by_invalid_chars: True 2024-09-26 04:02:56.288 INFO magic_pdf.model.pdf_extract_kit:init:180 - DocAnalysis init, this may take some times. apply_layout: True, apply_formula: True, apply_ocr: True, apply_table: True 2024-09-26 04:02:56.288 INFO magic_pdf.model.pdf_extract_kit:init:188 - using device: cuda 2024-09-26 04:02:56.288 INFO magic_pdf.model.pdf_extract_kit:init:190 - using models_dir: /root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models CustomVisionEncoderDecoderModel init CustomMBartForCausalLM init CustomMBartDecoder init [09/26 04:03:23 detectron2]: Rank of current process: 0. World size: 1 [09/26 04:03:24 detectron2]: Environment info:

sys.platform linux Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] numpy 1.26.4 detectron2 0.6 @/opt/mineru_venv/lib/python3.10/site-packages/detectron2 Compiler GCC 11.4 CUDA compiler not available DETECTRON2_ENV_MODULE PyTorch 2.3.1+cu121 @/opt/mineru_venv/lib/python3.10/site-packages/torch PyTorch debug build False torch._C._GLIBCXX_USE_CXX11_ABI False GPU available Yes GPU 0 NVIDIA H100 80GB HBM3 (arch=9.0) Driver version 535.104.12 CUDA_HOME None - invalid! Pillow 10.4.0 torchvision 0.18.1+cu121 @/opt/mineru_venv/lib/python3.10/site-packages/torchvision torchvision arch flags /opt/mineru_venv/lib/python3.10/site-packages/torchvision/_C.so fvcore 0.1.5.post20221221 iopath 0.1.9 cv2 4.6.0


PyTorch built with:

[09/26 04:03:24 detectron2]: Command line arguments: {'config_file': '/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml', 'resume': False, 'eval_only': False, 'num_gpus': 1, 'num_machines': 1, 'machine_rank': 0, 'dist_url': 'tcp://127.0.0.1:57823', 'opts': ['MODEL.WEIGHTS', '/root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models/Layout/model_final.pth']} [09/26 04:03:24 detectron2]: Contents of args.config_file=/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml: AUG: DETR: true CACHE_DIR: ~/cache/huggingface CUDNN_BENCHMARK: false DATALOADER: ASPECT_RATIO_GROUPING: true FILTER_EMPTY_ANNOTATIONS: false NUM_WORKERS: 4 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST:

[09/26 04:03:26 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models/Layout/model_final.pth ... [09/26 04:03:26 fvcore.common.checkpoint]: [Checkpointer] Loading from /root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models/Layout/model_final.pth ... 2024-09-26 04:03:36.702 | INFO | magic_pdf.model.pdf_extract_kit:init:248 - DocAnalysis init done! 2024-09-26 04:03:36.704 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:98 - model init cost: 54.46890354156494 2024-09-26 04:03:40.151 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 1.95

0: 1888x1312 (no detections), 153.6ms Speed: 40.8ms preprocess, 153.6ms inference, 1.1ms postprocess per image at shape (1, 3, 1888, 1312) 2024-09-26 04:03:41.320 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 0, mfr time: 0.0 2024-09-26 04:03:42.920 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 1.55 2024-09-26 04:03:42.920 | INFO | magic_pdf.model.pdf_extract_kit:call:407 - table cost: 0.0 2024-09-26 04:03:43.730 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 0.81

0: 1888x1312 4 embeddings, 12.7ms Speed: 18.1ms preprocess, 12.7ms inference, 5.0ms postprocess per image at shape (1, 3, 1888, 1312) 2024-09-26 04:03:45.443 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 4, mfr time: 1.48 2024-09-26 04:03:45.823 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 0.33 2024-09-26 04:03:45.824 | INFO | magic_pdf.model.pdf_extract_kit:call:407 - table cost: 0.0 2024-09-26 04:03:46.483 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 0.66

0: 1888x1312 (no detections), 16.8ms Speed: 44.8ms preprocess, 16.8ms inference, 1.2ms postprocess per image at shape (1, 3, 1888, 1312) 2024-09-26 04:03:46.551 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 0, mfr time: 0.0 2024-09-26 04:03:46.897 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 0.3 2024-09-26 04:03:46.897 | INFO | magic_pdf.model.pdf_extract_kit:call:407 - table cost: 0.0 2024-09-26 04:03:47.528 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 0.63

0: 1888x1312 (no detections), 12.6ms Speed: 17.9ms preprocess, 12.6ms inference, 0.6ms postprocess per image at shape (1, 3, 1888, 1312) 2024-09-26 04:03:47.561 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 0, mfr time: 0.0 2024-09-26 04:03:47.793 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 0.21 2024-09-26 04:03:47.794 | INFO | magic_pdf.model.pdf_extract_kit:call:407 - table cost: 0.0 2024-09-26 04:03:48.448 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 0.65

0: 1888x1312 (no detections), 12.5ms Speed: 17.7ms preprocess, 12.5ms inference, 0.6ms postprocess per image at shape (1, 3, 1888, 1312) 2024-09-26 04:03:48.481 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 0, mfr time: 0.0 2024-09-26 04:03:48.730 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 0.23 2024-09-26 04:03:48.730 | INFO | magic_pdf.model.pdf_extract_kit:call:407 - table cost: 0.0 2024-09-26 04:03:49.518 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 0.79

0: 1888x1312 (no detections), 12.5ms Speed: 19.7ms preprocess, 12.5ms inference, 0.6ms postprocess per image at shape (1, 3, 1888, 1312) 2024-09-26 04:03:49.553 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 0, mfr time: 0.0 2024-09-26 04:03:49.808 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 0.23 2024-09-26 04:03:49.809 | INFO | magic_pdf.model.pdf_extract_kit:call:407 - table cost: 0.0 2024-09-26 04:03:50.576 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 0.77

0: 1888x1312 3 embeddings, 12.5ms Speed: 21.9ms preprocess, 12.5ms inference, 1.6ms postprocess per image at shape (1, 3, 1888, 1312) 2024-09-26 04:03:51.198 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 3, mfr time: 0.52 2024-09-26 04:03:51.430 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 0.2 2024-09-26 04:03:51.430 | INFO | magic_pdf.model.pdf_extract_kit:call:407 - table cost: 0.0 2024-09-26 04:03:52.212 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 0.78

0: 1888x1312 (no detections), 13.2ms Speed: 29.9ms preprocess, 13.2ms inference, 0.8ms postprocess per image at shape (1, 3, 1888, 1312) 2024-09-26 04:03:52.259 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 0, mfr time: 0.0 2024-09-26 04:03:52.582 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 0.23 2024-09-26 04:03:52.582 | INFO | magic_pdf.model.pdf_extract_kit:call:407 - table cost: 0.0 2024-09-26 04:03:52.582 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:136 - doc analyze cost: 14.38168454170227 2024-09-26 04:03:52.873 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:242 - page_id: 0, last_page_cost_time: 0.0 2024-09-26 04:03:52.878 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:242 - page_id: 1, last_page_cost_time: 0.01 2024-09-26 04:03:52.879 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:242 - page_id: 2, last_page_cost_time: 0.0 2024-09-26 04:03:52.879 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:242 - page_id: 3, last_page_cost_time: 0.0 2024-09-26 04:03:52.880 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:242 - page_id: 4, last_page_cost_time: 0.0 2024-09-26 04:03:52.880 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:242 - page_id: 5, last_page_cost_time: 0.0 2024-09-26 04:03:52.887 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:242 - page_id: 6, last_page_cost_time: 0.01 2024-09-26 04:03:52.888 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:242 - page_id: 7, last_page_cost_time: 0.0 2024-09-26 04:03:52.901 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:145 - 发现了列表,列表行数:[(1, 2)], [[1]] 2024-09-26 04:03:52.901 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:158 - 列表行的第1到第2行是列表 2024-09-26 04:03:52.903 | INFO | magic_pdf.para.para_split_v2:connect_list_inter_page:471 - 连接page 2 内的list 2024-09-26 04:03:53.056 | INFO | magic_pdf.pipe.UNIPipe:pipe_mk_markdown:53 - uni_pipe mk mm_markdown finished 2024-09-26 04:03:53.074 | INFO | magic_pdf.pipe.UNIPipe:pipe_mk_uni_format:48 - uni_pipe mk content list finished 2024-09-26 04:03:53.075 | INFO | magic_pdf.tools.common:do_parse:139 - local output dir is /root/data/small_ocr/auto

James-Dao commented 1 month ago

你的测试文件能正常运行,但是好像没有table。

myhloli commented 1 month ago

运行的很正常,整套环境应该都是ok的,之前的表格报错更像是偶发样本,可以共享给我们看看能不能修复这个问题

myhloli commented 1 month ago

你的测试文件能正常运行,但是好像没有table。

paddle能正常运行就行,你可以另找一些其他带表格的pdf自行测试

James-Dao commented 1 month ago

我给你找一下我的测试数据,发几页给你测试一下看看。

James-Dao commented 1 month ago

还有一个问题是,我的pdf里是粤语的,然后用cpu去跑,慢是慢了一点,跑出来的数据里,有一些字别识别错误了。没有开启ocr。 这个好解决?

James-Dao commented 1 month ago

[Uploading haiguan-12-19.pdf…]()

James-Dao commented 1 month ago

这个是我测试的文件。