opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
https://opendatalab.com/OpenSourceTools?tool=extract
GNU Affero General Public License v3.0
17.69k stars 1.28k forks source link

d0558ab这个版本的Dockerfile构建时缺少yaml库无法构建 #955

Closed cyicz123 closed 5 days ago

cyicz123 commented 5 days ago

Description of the bug | 错误描述

按照文档使用Docker构建时,会出现缺少yaml库而构建失败的错误。修改Dockerfile,安装PyYAML后,能够解决此问题。

RUN /bin/bash -c "pip3 install modelscope PyYAML && \ # 安装PyYAML库
    wget https://gitee.com/myhloli/MinerU/raw/master/scripts/download_models.py && \
    python3 download_models.py && \
    sed -i 's|cpu|cuda|g' /root/magic-pdf.json"

How to reproduce the bug | 如何复现

(base) ➜  MinerU docker build -t mineru:latest .
[+] Building 398.4s (10/10) FINISHED                                                                                                                                 docker:default
 => [internal] load build definition from Dockerfile                                                                                                                           2.3s
 => => transferring dockerfile: 2.11kB                                                                                                                                         0.0s
 => [internal] load metadata for docker.io/library/ubuntu:22.04                                                                                                               32.5s
 => [internal] load .dockerignore                                                                                                                                              0.3s
 => => transferring context: 2B                                                                                                                                                0.0s
 => [1/7] FROM docker.io/library/ubuntu:22.04@sha256:0e5e4a57c2499249aafc3b40fcd541e9a456aab7296681a3994d631587203f97                                                          0.0s
 => CACHED [2/7] RUN apt-get update &&     apt-get install -y         software-properties-common &&     add-apt-repository ppa:deadsnakes/ppa &&     apt-get update &&     ap  0.0s
 => CACHED [3/7] RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1                                                                              0.0s
 => CACHED [4/7] RUN python3 -m venv /opt/mineru_venv                                                                                                                          0.0s
 => CACHED [5/7] RUN /bin/bash -c "source /opt/mineru_venv/bin/activate &&     pip3 install --upgrade pip &&     wget https://gitee.com/myhloli/MinerU/raw/master/requirement  0.0s
 => CACHED [6/7] RUN /bin/bash -c "wget https://gitee.com/myhloli/MinerU/raw/master/magic-pdf.template.json &&     cp magic-pdf.template.json /root/magic-pdf.json &&     sou  0.0s
 => ERROR [7/7] RUN /bin/bash -c "pip3 install modelscope &&     wget https://gitee.com/myhloli/MinerU/raw/master/scripts/download_models.py &&     python3 download_models  361.5s
------
 > [7/7] RUN /bin/bash -c "pip3 install modelscope &&     wget https://gitee.com/myhloli/MinerU/raw/master/scripts/download_models.py &&     python3 download_models.py &&     sed -i 's|cpu|cuda|g' /root/magic-pdf.json":
33.05 WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)")': /simple/modelscope/
35.60 Collecting modelscope
37.59   Downloading modelscope-1.20.0-py3-none-any.whl (5.8 MB)
311.2      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.8/5.8 MB 15.4 kB/s eta 0:00:00
313.0 Collecting urllib3>=1.26
313.1   Downloading urllib3-2.2.3-py3-none-any.whl (126 kB)
320.4      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 126.3/126.3 KB 19.3 kB/s eta 0:00:00
321.9 Collecting tqdm>=4.64.0
322.0   Downloading tqdm-4.67.0-py3-none-any.whl (78 kB)
326.3      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.6/78.6 KB 17.7 kB/s eta 0:00:00
327.4 Collecting requests>=2.25
327.5   Downloading requests-2.32.3-py3-none-any.whl (64 kB)
330.5      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.9/64.9 KB 24.8 kB/s eta 0:00:00
331.3 Collecting certifi>=2017.4.17
331.4   Downloading certifi-2024.8.30-py3-none-any.whl (167 kB)
339.0      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 167.3/167.3 KB 22.4 kB/s eta 0:00:00
339.3 Collecting idna<4,>=2.5
339.4   Downloading idna-3.10-py3-none-any.whl (70 kB)
344.0      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70.4/70.4 KB 13.9 kB/s eta 0:00:00
347.3 Collecting charset-normalizer<4,>=2
347.4   Downloading charset_normalizer-3.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (144 kB)
353.2      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 144.8/144.8 KB 25.3 kB/s eta 0:00:00
353.6 Installing collected packages: urllib3, tqdm, idna, charset-normalizer, certifi, requests, modelscope
359.3 Successfully installed certifi-2024.8.30 charset-normalizer-3.4.0 idna-3.10 modelscope-1.20.0 requests-2.32.3 tqdm-4.67.0 urllib3-2.2.3
359.3 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
359.4 --2024-11-14 04:53:21--  https://gitee.com/myhloli/MinerU/raw/master/scripts/download_models.py
359.4 Resolving gitee.com (gitee.com)... 180.76.198.225, 180.76.198.77
359.4 Connecting to gitee.com (gitee.com)|180.76.198.225|:443... connected.
359.5 HTTP request sent, awaiting response... 200 OK
359.7 Length: 1921 (1.9K) [text/plain]
359.7 Saving to: 'download_models.py'
359.7
359.7      0K .                                                     100%  164M=0s
359.7
359.7 2024-11-14 04:53:21 (164 MB/s) - 'download_models.py' saved [1921/1921]
359.7
359.9 Traceback (most recent call last):
359.9   File "/usr/local/lib/python3.10/dist-packages/modelscope/utils/import_utils.py", line 451, in _get_module
359.9     return importlib.import_module('.' + module_name, self.__name__)
359.9   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
359.9     return _bootstrap._gcd_import(name[level:], package, level)
359.9   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
359.9   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
359.9   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
359.9   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
359.9   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
359.9   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
359.9   File "/usr/local/lib/python3.10/dist-packages/modelscope/fileio/io.py", line 8, in <module>
359.9     from .format import JsonHandler, YamlHandler
359.9   File "/usr/local/lib/python3.10/dist-packages/modelscope/fileio/format/__init__.py", line 5, in <module>
359.9     from .yaml import YamlHandler
359.9   File "/usr/local/lib/python3.10/dist-packages/modelscope/fileio/format/yaml.py", line 2, in <module>
359.9     import yaml
359.9 ModuleNotFoundError: No module named 'yaml'
359.9
359.9 The above exception was the direct cause of the following exception:
359.9
359.9 Traceback (most recent call last):
359.9   File "/usr/local/lib/python3.10/dist-packages/modelscope/utils/import_utils.py", line 451, in _get_module
359.9     return importlib.import_module('.' + module_name, self.__name__)
359.9   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
359.9     return _bootstrap._gcd_import(name[level:], package, level)
359.9   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
359.9   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
359.9   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
359.9   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
359.9   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
359.9   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
359.9   File "/usr/local/lib/python3.10/dist-packages/modelscope/hub/snapshot_download.py", line 11, in <module>
359.9     from modelscope.hub.api import HubApi, ModelScopeConfig
359.9   File "/usr/local/lib/python3.10/dist-packages/modelscope/hub/api.py", line 26, in <module>
359.9     from modelscope.fileio import io
359.9   File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
359.9   File "/usr/local/lib/python3.10/dist-packages/modelscope/utils/import_utils.py", line 432, in __getattr__
359.9     value = self._get_module(name)
359.9   File "/usr/local/lib/python3.10/dist-packages/modelscope/utils/import_utils.py", line 453, in _get_module
359.9     raise RuntimeError(
359.9 RuntimeError: Failed to import modelscope.fileio.io because of the following error (look up to see its traceback):
359.9 No module named 'yaml'
359.9
359.9 The above exception was the direct cause of the following exception:
359.9
359.9 Traceback (most recent call last):
359.9   File "//download_models.py", line 5, in <module>
359.9     from modelscope import snapshot_download
359.9   File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
359.9   File "/usr/local/lib/python3.10/dist-packages/modelscope/utils/import_utils.py", line 434, in __getattr__
359.9     module = self._get_module(self._class_to_module[name])
359.9   File "/usr/local/lib/python3.10/dist-packages/modelscope/utils/import_utils.py", line 453, in _get_module
359.9     raise RuntimeError(
359.9 RuntimeError: Failed to import modelscope.hub.snapshot_download because of the following error (look up to see its traceback):
359.9 Failed to import modelscope.fileio.io because of the following error (look up to see its traceback):
359.9 No module named 'yaml'
------
Dockerfile:44
--------------------
  43 |     # Download models and update the configuration file
  44 | >>> RUN /bin/bash -c "pip3 install modelscope && \
  45 | >>>     wget https://gitee.com/myhloli/MinerU/raw/master/scripts/download_models.py && \
  46 | >>>     python3 download_models.py && \
  47 | >>>     sed -i 's|cpu|cuda|g' /root/magic-pdf.json"
  48 |
--------------------
ERROR: failed to solve: process "/bin/sh -c /bin/bash -c \"pip3 install modelscope &&     wget https://gitee.com/myhloli/MinerU/raw/master/scripts/download_models.py &&     python3 download_models.py &&     sed -i 's|cpu|cuda|g' /root/magic-pdf.json\"" did not complete successfully: exit code: 1

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cuda

myhloli commented 5 days ago

复测确认是由于modelscope更新1.20.0版本加入了import yaml而没有更新requirements.txt导致的,可以临时通过指定modelscope版本为1.19.2或自行安装pyyaml解决。

myhloli commented 5 days ago

https://github.com/modelscope/modelscope/releases/tag/v1.20.1

fixed