opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
10.91k stars 805 forks source link

Ubuntu22.04相关依赖问题 #242

Closed newplay closed 1 month ago

newplay commented 1 month ago

Description of the bug | 错误描述

使用ubuntu22.04lts安装会有诸多依赖问题

How to reproduce the bug | 如何复现

按照readme.md执行源码安装,很容易出现各种依赖问题,如paddle核心崩溃,torchtext出现ValueError:"SP_DIR"问题 目前测试的解决方案: 创建虚拟环境并进入创建好的虚拟环境

conda create -n MinerU python=3.10
conda activate MinerU

进到MinerU的根目录

cd /xxx/xxx/MinerU/

执行setup.py

python setup.py install
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/ 

这里可能会有一个ImportError: libmupdf.so.24.8: cannot open shared object file: No such file or directory的问题,找到libmupdf.so.24.8然后将路径放到$LD_LIBRARY_PATH即可。

find /xxx/anaconda3/envs/MinerU/ -name "libmupdf.so*"
export LD_LIBRARY_PATH=/xxx/anaconda3/envs/MinerU/lib/python3.10/site-packages/PyMuPDFb-1.24.9-py3.10-linux-x86_64.egg/pymupdf/:$LD_LIBRARY_PATH

接著conda install 需要的依赖:

conda install pytorch=2.3.0 torchvision=0.18.0 torchtext=0.18.0 torchaudio pytorch-cuda=11.8 cudatoolkit=11.8 cudnn paddlepaddle-gpu=3.0.0b1 paddlepaddle-cuda=11.8 -c paddle -c pytorch -c nvidia
conda install ultralytics -c conda-forge#yolo8需要安装这个

这里我改成pytorch=2.3.0,测试发现这个版本比较没有依赖的限制,若使用pytorch=2.3.1可能会碰到torchtext版本过低无法安装或者使用pip安装torchtext=0.18.0导致输出为SP_DIR而停止。

pip install paddleocr==2.7.3#需要使用pip安装,因为conda中的版本太低无法适配python3.10
pip install unimernet#执行完magic-pdf pdf-command --pdf后提示需要这个 需要使用pip安装
pip install ultralytics-thop#一样执行完后提示需要这个依赖,conda无法安装因此使用pip 安装

下载模型与magic-pdf.json的设置没有改变

以上是我解决Ubuntu22.04lts无法执行的方法,供大家参考

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

myhloli commented 1 month ago

Thanks for your feedback. Conda is recommended for environment creation only. Try to install all dependencies using pip, which can significantly reduce version conflict issues. Ubuntu 22.04 LTS is our mainline compatibility version. In subsequent compatibility releases, a single command will be supported to install the full package without encountering compatibility issues.

newplay commented 1 month ago

Thanks for your feedback. Conda is recommended for environment creation only. Try to install all dependencies using pip, which can significantly reduce version conflict issues. Ubuntu 22.04 LTS is our mainline compatibility version. In subsequent compatibility releases, a single command will be supported to install the full package without encountering compatibility issues.

I agree with what you are saying, but I just wanted to share my solution for handling dependency issues. I encountered many problems and errors when using pip install magic-pdf[full-cpu]. Before the new version is released, I hope to provide a simpler workaround so that more people can try out this tool. I'm really looking forward to the new version and excited to see the improvements it will bring. Best Regards, TzuChing

myhloli commented 1 month ago

@newplay Thank you once more for your sincere contribution,hope you have an awesome day!

myhloli commented 1 month ago

We update 0.6.2b1 release,solved this problem.