opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
12.99k stars 964 forks source link

Colab mineru_demo.ipynb failing; bad `/MFD/weights.pt` file reference #732

Open Analect opened 3 days ago

Analect commented 3 days ago

Description of the bug | 错误描述

I was trying to get this running on Colab. When running step !magic-pdf -p demo1.pdf -o output/ -m auto was getting [Errno 2] No such file or directory: '/root/.cache/huggingface/hub/models--opendatalab--PDF-Extract-Kit/snapshots/a29caa466f6d07be0e4863bba64204009128931a/MFD/weights.pt'. It seems that reference is missing a subfolder models that sits above MFD/weights.pt.

image

Also, T4 on Colab has following cuda verison.

Sun Oct 13 12:15:42 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

For step !pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/, what do you suggest? I notice there is no https://www.paddlepaddle.org.cn/packages/stable/cu122/.

Thanks.

How to reproduce the bug | 如何复现

Open colab demo from https://colab.research.google.com/gist/papayalove/b5f4913389e7ff9883c6b687de156e78/mineru_demo.ipynb.

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.8.x

Device mode | 设备模式

cuda

myhloli commented 3 days ago
  1. About not found modles file. !wget https://github.com/opendatalab/MinerU/raw/master/magic-pdf.template.json && mv magic-pdf.template.json ~/magic-pdf.json && sed -i 's|/tmp/models|{model_dir}|g' ~/magic-pdf.json

->

!wget https://github.com/opendatalab/MinerU/raw/master/magic-pdf.template.json && mv magic-pdf.template.json ~/magic-pdf.json && sed -i 's|/tmp/models|{model_dir}/models|g' ~/magic-pdf.json

is correct.

Of course, to simplify the process for users downloading model files and to prevent issues similar to the feedback received, we have updated the model download script. When you use the latest download_models_hf.py, the script will automatically download magic-pdf.json and configure the model path, eliminating the need for users to manually execute model path update code. For detailed procedures, please refer to: https://github.com/opendatalab/MinerU/blob/master/docs/README_Ubuntu_CUDA_Acceleration_en_US.md

  1. About palldegpu's version, cu118 is right. We avoid conflicts between paddlepaddle-gpu and torch with cu121, respectively, by using paddlepaddle-gpu with cu118 on the Linux system.