opendatalab / PDF-Extract-Kit

A Comprehensive Toolkit for High-Quality PDF Content Extraction
https://pdf-extract-kit.readthedocs.io/zh-cn/latest/index.html
GNU Affero General Public License v3.0
5.27k stars 357 forks source link

win10系统下运行时报错 #21

Closed borpubi closed 2 months ago

borpubi commented 3 months ago
>python pdf_extract.py --pdf ./pdf/第一单元.pdf
Namespace(pdf='./pdf/第一单元.pdf', output='output', vis=False, render=False)
2024-07-15 16:43:01
Started!
Traceback (most recent call last):
  File "D:\PDF-Extract-Kit\lib\site-packages\transformers\modeling_utils.py", line 533, in load_state_dict
    return torch.load(
  File "D:\PDF-Extract-Kit\lib\site-packages\torch\serialization.py", line 1004, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "D:\PDF-Extract-Kit\lib\site-packages\torch\serialization.py", line 456, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\PDF-Extract-Kit\lib\site-packages\transformers\modeling_utils.py", line 542, in load_state_dict
    if f.read(7) == "version":
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 64: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\PDF-Extract-Kit\PDF-Extract-Kit\pdf_extract.py", line 93, in <module>
    mfr_model, mfr_vis_processors = mfr_model_init(model_configs['model_args']['mfr_weight'], device=device)
  File "D:\PDF-Extract-Kit\PDF-Extract-Kit\pdf_extract.py", line 41, in mfr_model_init
    model = task.build_model(cfg)
  File "D:\PDF-Extract-Kit\lib\site-packages\unimernet\tasks\base_task.py", line 33, in build_model
    return model_cls.from_config(model_config)
  File "D:\PDF-Extract-Kit\lib\site-packages\unimernet\models\unimernet\unimernet.py", line 102, in from_config
    model = cls(
  File "D:\PDF-Extract-Kit\lib\site-packages\unimernet\models\unimernet\unimernet.py", line 35, in __init__
    self.model = DonutEncoderDecoder(
  File "D:\PDF-Extract-Kit\lib\site-packages\unimernet\models\unimernet\encoder_decoder.py", line 714, in __init__
    self.model = CustomVisionEncoderDecoderModel.from_pretrained(model_name, config=self.config, length_aware=length_aware)
  File "D:\PDF-Extract-Kit\lib\site-packages\transformers\models\vision_encoder_decoder\modeling_vision_encoder_decoder.py", line 359, in from_pretrained
    return super().from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
  File "D:\PDF-Extract-Kit\lib\site-packages\transformers\modeling_utils.py", line 3481, in from_pretrained
    state_dict = load_state_dict(resolved_archive_file)
  File "D:\PDF-Extract-Kit\lib\site-packages\transformers\modeling_utils.py", line 554, in load_state_dict
    raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'models/MFR/UniMERNet\pytorch_model.bin' at 'models/MFR/UniMERNet\pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
borpubi commented 3 months ago

搞定上面的报错,出来个这个:没有output输出。

python pdf_extract.py --pdf pdf/第一单元.pdf
Namespace(pdf='pdf/第一单元.pdf', output='output', vis=False, render=False)
2024-07-16 14:06:25
Started!
CustomVisionEncoderDecoderModel init
lylllllo commented 3 months ago

请问怎么解决的第一个问题

borpubi commented 3 months ago

请问怎么解决的第一个问题

下载的模型文件有问题,删除重新下载就好了。