opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://mineru.readthedocs.io/
GNU Affero General Public License v3.0
13.75k stars 1.03k forks source link

Python 3.10版本Unable to load weights from pytorch checkpoint file #330

Closed CeliaShu1024 closed 3 months ago

CeliaShu1024 commented 3 months ago

Description of the bug | 错误描述

如题,已经从魔搭的两个途径重装两份模型并验证但均得到如下报错。 OSError: Unable to load weights from pytorch checkpoint file for '/root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin' at '/root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

How to reproduce the bug | 如何复现

安装相关库和插件后通过命令行测试:

magic-pdf pdf-command --pdf "path_to_sample/sample.pdf" --inside_model true

完整错误报告:

| ERROR    | magic_pdf.cli.magicpdf:parse_doc:338 - Unable to load weights from pytorch checkpoint file for '/root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin' at '/root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
Traceback (most recent call last):

  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/transformers/modeling_utils.py", line 533, in load_state_dict
    return torch.load(
           │     └ <function torch_load at 0x7fe7d0aef250>
           └ <module 'torch' from '/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/torch/__init__.py'>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/ultralytics/utils/patches.py", line 86, in torch_load
    return _torch_load(*args, **kwargs)
           │            │       └ {'map_location': 'cpu', 'weights_only': True, 'mmap': True}
           │            └ ('/root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin',)
           └ <function load at 0x7fe8a38b76d0>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/torch/serialization.py", line 1015, in load
    overall_storage = torch.UntypedStorage.from_file(os.fspath(f), False, size)
                      │     │              │         │  │      │          └ 3750208149
                      │     │              │         │  │      └ '/root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin'
                      │     │              │         │  └ <built-in function fspath>
                      │     │              │         └ <module 'os' from '/root/anaconda3/envs/MinerU/lib/python3.10/os.py'>
                      │     │              └ <staticmethod(<built-in method from_file of torch._C._StorageMeta object at 0x7fe882a7f380>)>
                      │     └ <class 'torch.storage.UntypedStorage'>
                      └ <module 'torch' from '/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/torch/__init__.py'>

RuntimeError: unable to mmap 3750208149 bytes from file </root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin>: Cannot allocate memory (12)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/transformers/modeling_utils.py", line 542, in load_state_dict
    if f.read(7) == "version":
       │ └ <method 'read' of '_io.TextIOWrapper' objects>
       └ <_io.TextIOWrapper name='/root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin' mode='r' encoding='UTF-8'>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
                         │    │              │     │    │       └ False
                         │    │              │     │    └ 'strict'
                         │    │              │     └ <encodings.utf_8.IncrementalDecoder object at 0x7fe7a10ccf10>
                         │    │              └ b'PK\x03\x04\x00\x00\x08\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x18\x00\n\x00checkpoint_...
                         │    └ <built-in function utf_8_decode>
                         └ <encodings.utf_8.IncrementalDecoder object at 0x7fe7a10ccf10>

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/root/anaconda3/envs/MinerU/bin/magic-pdf", line 8, in <module>
    sys.exit(cli())
    │   │    └ <Group cli>
    │   └ <built-in function exit>
    └ <module 'sys' (built-in)>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           │    │     │       └ {}
           │    │     └ ()
           │    └ <function BaseCommand.main at 0x7fe8ada4b7f0>
           └ <Group cli>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         │    │      └ <click.core.Context object at 0x7fe8adecfc40>
         │    └ <function MultiCommand.invoke at 0x7fe8ada5c820>
         └ <Group cli>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
           │               │       │       │      └ <click.core.Context object at 0x7fe895366020>
           │               │       │       └ <function Command.invoke at 0x7fe8ada5c310>
           │               │       └ <Command pdf-command>
           │               └ <click.core.Context object at 0x7fe895366020>
           └ <function MultiCommand.invoke.<locals>._process_result at 0x7fe8adf2bd90>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           │   │      │    │           │   └ {'pdf': '/root/data/pdfs/0ceb182a-f194-4d67-bbc4-6bb1546e6f6e.pdf', 'inside_model': True, 'model': None, 'method': 'auto', 'm...
           │   │      │    │           └ <click.core.Context object at 0x7fe895366020>
           │   │      │    └ <function pdf_command at 0x7fe894d72050>
           │   │      └ <Command pdf-command>
           │   └ <function Context.invoke at 0x7fe8ada4b010>
           └ <click.core.Context object at 0x7fe895366020>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
                       │       └ {'pdf': '/root/data/pdfs/0ceb182a-f194-4d67-bbc4-6bb1546e6f6e.pdf', 'inside_model': True, 'model': None, 'method': 'auto', 'm...
                       └ ()
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/cli/magicpdf.py", line 352, in pdf_command
    parse_doc(pdf)
    │         └ '/root/data/pdfs/0ceb182a-f194-4d67-bbc4-6bb1546e6f6e.pdf'
    └ <function pdf_command.<locals>.parse_doc at 0x7fe894d71d80>
> File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/cli/magicpdf.py", line 330, in parse_doc
    do_parse(
    └ <function do_parse at 0x7fe894d71990>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/cli/magicpdf.py", line 111, in do_parse
    pipe.pipe_analyze()
    │    └ <function UNIPipe.pipe_analyze at 0x7fe894d705e0>
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fe895365c30>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py", line 31, in pipe_analyze
    self.model_list = doc_analyze(self.pdf_bytes, ocr=True)
    │    │            │           │    └ b'%PDF-1.6\r%\xe2\xe3\xcf\xd3\r\n53 0 obj\r<</Linearized 1/L 2128191/O 55/E 531771/N 4/T 2127832/H [ 442 190]>>\rendobj\r    ...
    │    │            │           └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fe895365c30>
    │    │            └ <function doc_analyze at 0x7fe8a337f520>
    │    └ []
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fe895365c30>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 103, in doc_analyze
    custom_model = model_manager.get_model(ocr, show_log)
                   │             │         │    └ False
                   │             │         └ True
                   │             └ <function ModelSingleton.get_model at 0x7fe8a337f490>
                   └ <magic_pdf.model.doc_analyze_by_custom_model.ModelSingleton object at 0x7fe895366d10>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 63, in get_model
    self._models[key] = custom_model_init(ocr=ocr, show_log=show_log)
    │    │       │      │                     │             └ False
    │    │       │      │                     └ True
    │    │       │      └ <function custom_model_init at 0x7fe8a337f370>
    │    │       └ (True, False)
    │    └ {}
    └ <magic_pdf.model.doc_analyze_by_custom_model.ModelSingleton object at 0x7fe895366d10>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 87, in custom_model_init
    custom_model = CustomPEKModel(ocr=ocr, show_log=show_log, models_dir=local_models_dir, device=device)
                   │                  │             │                    │                        └ 'cpu'
                   │                  │             │                    └ '/root/PDF-Extract-Kit/models'
                   │                  │             └ False
                   │                  └ True
                   └ <class 'magic_pdf.model.pdf_extract_kit.CustomPEKModel'>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 119, in __init__
    self.mfr_model, mfr_vis_processors = mfr_model_init(mfr_weight_dir, mfr_cfg_path, _device_=self.device)
    │                                    │              │               │                      │    └ 'cpu'
    │                                    │              │               │                      └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7fe895366d40>
    │                                    │              │               └ '/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/resources/model_config/UniMERNet/demo.yaml'
    │                                    │              └ '/root/PDF-Extract-Kit/models/MFR/UniMERNet'
    │                                    └ <function mfr_model_init at 0x7fe7a13f4dc0>
    └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7fe895366d40>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 47, in mfr_model_init
    model = task.build_model(cfg)
            │    │           └ <unimernet.common.config.Config object at 0x7fe7a0f43e50>
            │    └ <function BaseTask.build_model at 0x7fe7c1f72710>
            └ <unimernet.tasks.unimernet_train.UniMERNet_Train object at 0x7fe7a0f28df0>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/unimernet/tasks/base_task.py", line 33, in build_model
    return model_cls.from_config(model_config)
           │         │           └ {'arch': 'unimernet', 'load_finetuned': False, 'load_pretrained': True, 'pretrained': '/root/PDF-Extract-Kit/models/MFR/UniME...
           │         └ <classmethod(<function UniMERModel.from_config at 0x7fe7c1ef2200>)>
           └ <class 'unimernet.models.unimernet.unimernet.UniMERModel'>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/unimernet/models/unimernet/unimernet.py", line 102, in from_config
    model = cls(
            └ <class 'unimernet.models.unimernet.unimernet.UniMERModel'>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/unimernet/models/unimernet/unimernet.py", line 35, in __init__
    self.model = DonutEncoderDecoder(
    │            └ <class 'unimernet.models.unimernet.encoder_decoder.DonutEncoderDecoder'>
    └ UniMERModel()
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/unimernet/models/unimernet/encoder_decoder.py", line 714, in __init__
    self.model = CustomVisionEncoderDecoderModel.from_pretrained(model_name, config=self.config, length_aware=length_aware)
    │            │                               │               │                  │    │                    └ False
    │            │                               │               │                  │    └ VisionEncoderDecoderConfig {
    │            │                               │               │                  │        "_name_or_path": "unimernet/checkpoint-180000",
    │            │                               │               │                  │        "architectures": [
    │            │                               │               │                  │          "VisionEncoderDecoder...
    │            │                               │               │                  └ DonutEncoderDecoder()
    │            │                               │               └ '/root/PDF-Extract-Kit/models/MFR/UniMERNet'
    │            │                               └ <classmethod(<function VisionEncoderDecoderModel.from_pretrained at 0x7fe7c206f1c0>)>
    │            └ <class 'unimernet.models.unimernet.encoder_decoder.CustomVisionEncoderDecoderModel'>
    └ DonutEncoderDecoder()
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py", line 359, in from_pretrained
    return super().from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
                                   │                               │             └ {'config': VisionEncoderDecoderConfig {
                                   │                               │                 "_name_or_path": "unimernet/checkpoint-180000",
                                   │                               │                 "architectures": [
                                   │                               │                   "VisionEnc...
                                   │                               └ ()
                                   └ '/root/PDF-Extract-Kit/models/MFR/UniMERNet'
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3481, in from_pretrained
    state_dict = load_state_dict(resolved_archive_file)
                 │               └ '/root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin'
                 └ <function load_state_dict at 0x7fe7c22129e0>
  File "/root/anaconda3/envs/MinerU/lib/python3.10/site-packages/transformers/modeling_utils.py", line 554, in load_state_dict
    raise OSError(

OSError: Unable to load weights from pytorch checkpoint file for '/root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin' at '/root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

torch版本: torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cpu

myhloli commented 3 months ago

check一下模型文件的sha256与网页上是否一致。

CeliaShu1024 commented 3 months ago

check一下模型文件的sha256与网页上是否一致。

查出来和官网一致所以比较匪夷所思

(MinerU) root@iZf8z73npm2neua8j5km4fZ:~# shasum -a 256 /root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin
6c80486e05b8cfbb48324a8802a2909221d219dd46aa6a936b92f2225555935e  /root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin
myhloli commented 3 months ago

看下自己的transformers版本是不是4.40.0呢

CeliaShu1024 commented 3 months ago

看下自己的transformers版本是不是4.40.0呢

是 我这边附上所有的依赖库版本号来排查问题

(MinerU) root@iZf8z73npm2neua8j5km4fZ:~# pip list
Package                  Version
------------------------ ------------------
absl-py                  2.1.0
aiohappyeyeballs         2.3.4
aiohttp                  3.10.1
aiosignal                1.3.1
albucore                 0.0.13
albumentations           1.4.12
annotated-types          0.7.0
antlr4-python3-runtime   4.9.3
anyio                    4.4.0
astor                    0.8.1
async-timeout            4.0.3
attrdict                 2.0.1
attrs                    24.1.0
Babel                    2.15.0
bce-python-sdk           0.9.19
beautifulsoup4           4.12.3
black                    24.8.0
blinker                  1.8.2
boto3                    1.34.153
botocore                 1.34.153
braceexpand              0.1.7
Brotli                   1.1.0
cachetools               5.4.0
certifi                  2024.7.4
cffi                     1.16.0
charset-normalizer       3.3.2
click                    8.1.7
cloudpickle              3.0.0
colorlog                 6.8.2
contourpy                1.2.1
cryptography             43.0.0
cssselect                1.2.0
cssutils                 2.11.1
cycler                   0.12.1
Cython                   3.0.10
datasets                 2.20.0
decorator                5.1.1
detectron2               0.6
dill                     0.3.8
et-xmlfile               1.1.0
eva-decord               0.6.1
eval_type_backport       0.2.0
evaluate                 0.4.2
exceptiongroup           1.2.2
fairscale                0.4.13
fast-langdetect          0.2.0
fasttext-wheel           0.9.2
filelock                 3.15.4
fire                     0.6.0
Flask                    3.0.3
flask-babel              4.0.0
fonttools                4.53.1
frozenlist               1.4.1
fsspec                   2024.5.0
ftfy                     6.2.0
future                   1.0.0
fvcore                   0.1.5.post20221221
grpcio                   1.65.4
h11                      0.14.0
httpcore                 1.0.5
httpx                    0.27.0
huggingface-hub          0.24.5
hydra-core               1.3.2
idna                     3.7
imageio                  2.34.2
imgaug                   0.4.0
iopath                   0.1.9
itsdangerous             2.2.0
Jinja2                   3.1.4
jmespath                 1.0.1
joblib                   1.4.2
kiwisolver               1.4.5
langdetect               1.0.9
lazy_loader              0.4
lmdb                     1.5.1
loguru                   0.7.2
lxml                     5.2.2
magic-pdf                0.6.2b1
Markdown                 3.6
MarkupSafe               2.1.5
matplotlib               3.9.0
modelscope               1.17.0
more-itertools           10.3.0
mpmath                   1.3.0
multidict                6.0.5
multiprocess             0.70.16
mypy-extensions          1.0.0
networkx                 3.3
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.20.5
nvidia-nvjitlink-cu12    12.6.20
nvidia-nvtx-cu12         12.1.105
omegaconf                2.3.0
opencv-contrib-python    4.6.0.66
opencv-python            4.6.0.66
opencv-python-headless   4.10.0.84
openpyxl                 3.1.5
opt-einsum               3.3.0
packaging                24.1
paddleocr                2.7.3
paddlepaddle             3.0.0b1
pandas                   2.2.2
pathspec                 0.12.1
pdf2docx                 0.5.8
pdfminer.six             20231228
pillow                   10.4.0
pip                      24.0
platformdirs             4.2.2
portalocker              2.10.1
premailer                3.10.0
protobuf                 4.25.4
psutil                   6.0.0
py-cpuinfo               9.0.0
pyarrow                  17.0.0
pyarrow-hotfix           0.6
pybind11                 2.13.1
pyclipper                1.3.0.post5
pycocotools              2.0.8
pycparser                2.22
pycryptodome             3.20.0
pydantic                 2.8.2
pydantic_core            2.20.1
PyMuPDF                  1.24.9
PyMuPDFb                 1.24.9
pyparsing                3.1.2
python-dateutil          2.9.0.post0
python-docx              1.1.2
pytz                     2024.1
PyYAML                   6.0.1
rapidfuzz                3.9.5
rarfile                  4.2
regex                    2024.7.24
requests                 2.32.3
robust-downloader        0.0.2
s3transfer               0.10.2
safetensors              0.4.3
scikit-image             0.24.0
scikit-learn             1.5.1
scipy                    1.14.0
seaborn                  0.13.2
setuptools               72.1.0
shapely                  2.0.5
six                      1.16.0
sniffio                  1.3.1
soupsieve                2.5
sympy                    1.13.1
tabulate                 0.9.0
tensorboard              2.17.0
tensorboard-data-server  0.7.2
termcolor                2.4.0
threadpoolctl            3.5.0
tifffile                 2024.7.24
timm                     0.9.16
tokenizers               0.19.1
tomli                    2.0.1
torch                    2.3.1
torchtext                0.18.0
torchvision              0.18.1
tqdm                     4.66.5
transformers             4.40.0
triton                   2.3.1
typing_extensions        4.12.2
tzdata                   2024.1
ultralytics              8.2.73
ultralytics-thop         2.0.0
unimernet                0.1.6
urllib3                  2.2.2
visualdl                 2.5.3
Wand                     0.6.13
wcwidth                  0.2.13
webdataset               0.2.86
Werkzeug                 3.0.3
wheel                    0.43.0
wordninja                2.0.0
xxhash                   3.4.1
yacs                     0.1.8
yarl                     1.9.4
myhloli commented 3 months ago

依赖版本没有问题,感觉还是模型文件的问题,但是sha256一致也很诡异。如果不是模型文件损坏的话,可以看看这个文件的权限如何,程序是否对模型文件有权限读写。

CeliaShu1024 commented 3 months ago

我又重新梳理了一下报错信息 第一个Exception好像是内存分配的问题(如下)

RuntimeError: unable to mmap 3750208149 bytes from file </root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin>: Cannot allocate memory (12)

我在32G运存的Windows系统成功运行了模型 这边分配给我的测试环境的运存是2G因此我想排查一下是不是爆内存导致的模型读写失败 因此想问问magic-pdf是否有支持分布式运算或者修改chunk大小的功能

myhloli commented 3 months ago

我又重新梳理了一下报错信息 第一个Exception好像是内存分配的问题(如下)

RuntimeError: unable to mmap 3750208149 bytes from file </root/PDF-Extract-Kit/models/MFR/UniMERNet/pytorch_model.bin>: Cannot allocate memory (12)

我在32G运存的Windows系统成功运行了模型 这边分配给我的测试环境的运存是2G因此我想排查一下是不是爆内存导致的模型读写失败 因此想问问magic-pdf是否有支持分布式运算或者修改chunk大小的功能

抱歉,这些都不支持,运行程序需要设备拥有至少16g内存。

CeliaShu1024 commented 3 months ago

好的那我先测试下挂接显卡和本地Linux看看问题还会不会出现 如果没问题了我再给你反馈

CeliaShu1024 commented 3 months ago

已解决 扩大运存后跑通 输出文件和windows段测试的一致