opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
13.43k stars 1.01k forks source link

模型下载好了,config也配置好了,运行解析pdf,报错 #735

Closed fuxuelinwudi closed 3 days ago

fuxuelinwudi commented 2 weeks ago

Description of the bug | 错误描述

[10/14 10:33:42 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /data/deploy/pre_release/fxl/gomate/all_server/comp_server/mineru_root/root_models/PDF-Extract-Kit/models/Layout/model_final.pth ... [10/14 10:33:42 fvcore.common.checkpoint]: [Checkpointer] Loading from /data/deploy/pre_release/fxl/gomate/all_server/comp_server/mineru_root/root_models/PDF-Extract-Kit/models/Layout/model_final.pth ... 2024-10-14 10:33:45.280 | INFO | magic_pdf.model.pdf_extract_kit:init:248 - DocAnalysis init done! 2024-10-14 10:33:45.280 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:98 - model init cost: 25.69013738632202 2024-10-14 10:33:50.120 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 3.53

0: 1888x1312 (no detections), 78.5ms Speed: 16.2ms preprocess, 78.5ms inference, 0.6ms postprocess per image at shape (1, 3, 1888, 1312) 2024-10-14 10:33:50.704 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 0, mfr time: 0.0 2024-10-14 10:33:50.905 | ERROR | magic_pdf.tools.cli:parse_doc:96 - (External) CUBLAS error(7). [Hint: 'CUBLAS_STATUS_INVALID_VALUE'. An unsupported value or parameter was passed to the function (a negative vector size, for example). To correct: ensure that all the parameters being passed have valid values. ] (at ../paddle/phi/kernels/funcs/blas/blas_impl.cu.h:40) [operator < fc > error]

How to reproduce the bug | 如何复现

我的config信息:

{ "bucket_info": { "bucket-name-1": [ "ak", "sk", "endpoint" ], "bucket-name-2": [ "ak", "sk", "endpoint" ] }, "models-dir": "/data/deploy/pre_release/fxl/gomate/all_server/comp_server/mineru_root/root_models/PDF-Extract-Kit/models", "layoutreader-model-dir": "/data/deploy/pre_release/fxl/gomate/all_server/comp_server/mineru_root/root_models/layoutreader", "device-mode": "cuda", "table-config": { "model": "TableMaster", "is_table_recog_enable": false, "max_time": 400 } }

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.8.x

Device mode | 设备模式

cuda

fuxuelinwudi commented 2 weeks ago

我的cuda version是121,运行了 python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ 是不是这个问题? 但是我运行: python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu121/ paddle装不上

myhloli commented 2 weeks ago

看报错是paddle和cuda不兼容,linux的paddle是自带cuda环境的,是需要装118版本避免和torch的12.1冲突,都不会用到你系统的cuda环境

fuxuelinwudi commented 2 weeks ago

我该怎么做?

看报错是paddle和cuda不兼容,linux的paddle是自带cuda环境的,是需要装118版本避免和torch的12.1冲突,都不会用到你系统的cuda环境

myhloli commented 2 weeks ago

可以卸了paddlepaddle-gpu和paddlepaddle,再重装paddlepaddle,使用cpu版本的paddle运行。

fuxuelinwudi commented 2 weeks ago

可以卸了paddlepaddle-gpu和paddlepaddle,再重装paddlepaddle,使用cpu版本的paddle运行。

可以了,谢谢,如果我要用cuda版本的,是需要linux系统的cuda为11.8吗

myhloli commented 2 weeks ago

不需要改linux内的cuda,linux的torch和paddle的cuda都是通过pip依赖的形式安装在conda的虚拟环境中的,linux只需要安装driver即可。如果按教程安装下来,结果不兼容,一般都比较难调,建议降级到cpu使用或者更换部署环境。

fuxuelinwudi commented 2 weeks ago

不需要改linux内的cuda,linux的torch和paddle的cuda都是通过pip依赖的形式安装在conda的虚拟环境中的,linux只需要安装driver即可。如果按教程安装下来,结果不兼容,一般都比较难调,建议降级到cpu使用或者更换部署环境。

好的

fuxuelinwudi commented 2 weeks ago

请问 这个shell: magic-pdf -p small_ocr.pdf

执行的是哪个py文件?我想自己写一个py

fuxuelinwudi commented 2 weeks ago

我安装了118cuda,然后安装了paddle-gpu,又报了个错:

2024-10-14 11:36:23.335 | ERROR | magic_pdf.tools.cli:parse_doc:96 - Unable to avoid copy while creating an array as requested. If using np.array(obj, copy=False) replace it with np.asarray(obj) to allow a copy when needed (no behavior change in NumPy 1.x). For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword

myhloli commented 2 weeks ago

请问 这个shell: magic-pdf -p small_ocr.pdf

执行的是哪个py文件?我想自己写一个py

https://github.com/opendatalab/MinerU/blob/master/magic_pdf/tools/cli.py

myhloli commented 2 weeks ago

我安装了118cuda,然后安装了paddle-gpu,又报了个错:

2024-10-14 11:36:23.335 | ERROR | magic_pdf.tools.cli:parse_doc:96 - Unable to avoid copy while creating an array as requested. If using np.array(obj, copy=False) replace it with np.asarray(obj) to allow a copy when needed (no behavior change in NumPy 1.x). For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword

不兼容2.0以上的numpy,需要降级到1.x

fuxuelinwudi commented 2 weeks ago

我看识别的md结果,好像做不到多级标题的识别?全部被识别为一级标题了

myhloli commented 2 weeks ago

我看识别的md结果,好像做不到多级标题的识别?全部被识别为一级标题了

目前没有多级标题识别能力