opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.19k stars 835 forks source link

使用表格识别后出现bug #475

Open Maple0709 opened 3 weeks ago

Maple0709 commented 3 weeks ago

Description of the bug | 错误描述

将magic_pdf.json中的"is_table_recog_enable":设置为true后,出现bug,若是false就不会出现此问题

How to reproduce the bug | 如何复现

Traceback (most recent call last): File "/data/MinerU/pdf_extract_test.py", line 20, in pipe = UNIPipe(jso_useful_key, img_writer) File "/data/MinerU/magic_pdf/pipe/UNIPipe.py", line 60, in init self.txt_custom_model = self.model_manager.get_model(ocr=False, show_log=show_log) File "/data/MinerU/magic_pdf/model/doc_analyze_by_custom_model.py", line 64, in get_model self._models[key] = custom_model_init(ocr=ocr, show_log=show_log) File "/data/MinerU/magic_pdf/model/doc_analyze_by_custom_model.py", line 94, in custom_model_init custom_model = CustomPEKModel(*model_input) File "/data/MinerU/magic_pdf/model/pdf_extract_kit.py", line 146, in init self.table_model = table_model_init(str(os.path.join(models_dir, self.configs["weights"]["table"])), File "/data/MinerU/magic_pdf/model/pdf_extract_kit.py", line 40, in table_model_init table_model = StructTableModel(model_path, max_time=max_time, device=device) File "/data/MinerU/magic_pdf/model/pek_sub_modules/structeqtable/StructTableModel.py", line 10, in init self.model = StructTable(self.model_path, self.max_new_tokens, self.max_time).cuda() File "/opt/conda/lib/python3.10/site-packages/struct_eqtable/model.py", line 17, in init self.init_model(model_path) File "/opt/conda/lib/python3.10/site-packages/struct_eqtable/model.py", line 27, in init_model self.model = AutoModelForVision2Seq.from_pretrained(model_path) File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained return model_class.from_pretrained( File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3550, in from_pretrained model = cls(config, model_args, **model_kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/pix2struct/modeling_pix2struct.py", line 1567, in init self.encoder = Pix2StructVisionModel(config.vision_config) File "/opt/conda/lib/python3.10/site-packages/transformers/models/pix2struct/modeling_pix2struct.py", line 533, in init self.encoder = Pix2StructVisionEncoder(config) File "/opt/conda/lib/python3.10/site-packages/transformers/models/pix2struct/modelingpix2struct.py", line 304, in init self.layer = nn.ModuleList([Pix2StructVisionLayer(config) for in range(config.num_hidden_layers)]) File "/opt/conda/lib/python3.10/site-packages/transformers/models/pix2struct/modelingpix2struct.py", line 304, in self.layer = nn.ModuleList([Pix2StructVisionLayer(config) for in range(config.num_hidden_layers)]) File "/opt/conda/lib/python3.10/site-packages/transformers/models/pix2struct/modeling_pix2struct.py", line 264, in init self.pre_mlp_layer_norm = Pix2StructLayerNorm(config.hidden_size, eps=config.layer_norm_eps) File "/opt/conda/lib/python3.10/site-packages/apex/normalization/fused_layer_norm.py", line 364, in init fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda") File "/opt/conda/lib/python3.10/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 674, in _load_unlocked File "", line 571, in module_from_spec File "", line 1176, in create_module File "", line 241, in _call_with_frames_removed ImportError: /opt/conda/lib/python3.10/site-packages/fused_layer_norm_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEENS6_INS2_12MemoryFormatEEE

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.7.x

Device mode | 设备模式

cuda

papayalove commented 3 weeks ago

降低cython版本为0.29.36

Maple0709 commented 3 weeks ago

cython版本本来就是为0.29.36

Juan-hwt commented 2 weeks ago

请问,md天生不支持合并单元格,如果遇到合并单元格,怎么识别呢?

Andy197527 commented 2 weeks ago

请问,md天生不支持合并单元格,如果遇到合并单元格,怎么识别呢?

解决了吗?

Jalen-Zhong commented 2 weeks ago

请问,md天生不支持合并单元格,如果遇到合并单元格,怎么识别呢?

html内嵌可以解决多头表格合并问题, #360 的方法我自己测试准确率有80%以上。另外官方给的表格识别方法输出是Latex格式。上述方法的缺点是不支持带有水印的表格。

Andy197527 commented 2 weeks ago

请问,md天生不支持合并单元格,如果遇到合并单元格,怎么识别呢?

html内嵌可以解决多头表格合并问题, #360 的方法我自己测试准确率有80%以上。另外官方给的表格识别方法输出是Latex格式。上述方法的缺点是不支持带有水印的表格。

请问有什么好的,面对复杂表格的识别模型推荐的吗?

papayalove commented 2 weeks ago

cython版本本来就是为0.29.36

环境列表发来看下吧

Jalen-Zhong commented 2 weeks ago

请问,md天生不支持合并单元格,如果遇到合并单元格,怎么识别呢?

html内嵌可以解决多头表格合并问题, #360 的方法我自己测试准确率有80%以上。另外官方给的表格识别方法输出是Latex格式。上述方法的缺点是不支持带有水印的表格。

请问有什么好的,面对复杂表格的识别模型推荐的吗?

我也有这方便的需求,目前没有找到更好的解决方案。建议可以试试 #360 的方法或者多模态大模型,经过我的测试,部分多模态大模型不具备多头合并表格的识别(即使要求返回html格式)。

Maple0709 commented 2 weeks ago

cython版本本来就是为0.29.36

环境列表发来看下吧

环境列表如下: absl-py 2.0.0 accelerate 0.24.1 adaseq 0.6.6 addict 2.4.0 aiohttp 3.9.1 aiosignal 1.3.1 albucore 0.0.12 albumentations 1.4.11 aliyun-python-sdk-core 2.14.0 aliyun-python-sdk-kms 2.16.2 altair 5.3.0 annotated-types 0.6.0 antlr4-python3-runtime 4.9.3 anyio 3.7.1 apex 0.1 appdirs 1.4.4 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 astor 0.8.1 asttokens 2.4.1 astunparse 1.6.3 async-lru 2.0.4 async-timeout 4.0.3 attrdict 2.0.1 attrs 23.1.0 audioread 3.0.1 auto-gptq 0.5.1+cu118 av 11.0.0 Babel 2.13.1 basicsr 1.4.2 bce-python-sdk 0.9.17 beartype 0.16.4 beautifulsoup4 4.12.2 biopython 1.81 bitarray 2.8.3 bitsandbytes 0.41.2.post2 bitstring 4.1.4 black 23.11.0 bleach 6.1.0 blinker 1.8.2 blis 0.7.11 blobfile 2.1.1 bmt-clipit 1.0 boltons 23.0.0 boto3 1.33.8 botocore 1.33.8 braceexpand 0.1.7 Brotli 1.1.0 brotlipy 0.7.0 cachetools 5.3.2 catalogue 2.0.10 certifi 2023.7.22 cffi 1.15.1 cfgv 3.4.0 charset-normalizer 2.0.4 chumpy 0.70 cityscapesScripts 2.2.2 click 8.1.7 clip 1.0 cloudpathlib 0.16.0 cloudpickle 3.0.0 colorama 0.4.6 coloredlogs 15.0.1 colorlog 6.8.2 comm 0.2.0 conda 23.9.0 conda-content-trust 0.2.0 conda-libmamba-solver 23.9.1 conda-package-handling 2.2.0 conda_package_streaming 0.9.0 confection 0.1.4 ConfigArgParse 1.7 contextlib2 21.6.0 contourpy 1.2.0 control-ldm 0.0.1 crcmod 1.7 cryptography 41.0.3 cssselect 1.2.0 cssutils 2.11.1 cycler 0.12.1 cymem 2.0.8 Cython 0.29.36 dataclasses 0.6 datasets 2.15.0 ddpm-guided-diffusion 0.0.0 debugpy 1.8.0 decorator 4.4.2 decord 0.6.0 deepspeed 0.12.3 defusedxml 0.7.1 descartes 1.1.0 detectron2 0.6 dgl 1.1.2+cu118 diffusers 0.21.4 dill 0.3.6 distlib 0.3.7 docx2txt 0.8 easydict 1.11 easyrobust 0.2.4 edit-distance 1.0.6 editdistance 0.6.2 einops 0.7.0 embeddings 0.0.8 emoji 2.9.0 et-xmlfile 1.1.0 eva-decord 0.6.1 eval_type_backport 0.2.0 evaluate 0.4.2 exceptiongroup 1.1.3 executing 2.0.1 expecttest 0.1.6 face-alignment 1.4.1 fairscale 0.4.13 fairseq 0.12.2 fast-langdetect 0.2.0 fastai 2.7.13 fastapi 0.104.1 fastcore 1.5.29 fastdownload 0.0.7 fastjsonschema 2.19.0 fastprogress 1.0.3 fasttext 0.9.2 fasttext-wheel 0.9.2 ffmpeg 1.4 ffmpeg-python 0.2.0 filelock 3.13.1 fire 0.5.0 flake8 6.1.0 Flask 3.0.3 flask-babel 4.0.0 flatbuffers 23.5.26 fonttools 4.44.3 fqdn 1.5.1 frozenlist 1.4.0 fsspec 2023.10.0 ftfy 6.2.0 funasr 0.8.7 funtextprocessing 0.1.1 future 0.18.3 fvcore 0.1.5.post20221221 gast 0.5.4 gekko 1.0.6 gitdb 4.0.11 GitPython 3.1.43 google-auth 2.23.4 google-auth-oauthlib 1.0.0 google-pasta 0.2.0 greenlet 3.0.1 grpcio 1.59.2 h11 0.14.0 h5py 3.10.0 hdbscan 0.8.33 hjson 3.1.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.19.4 humanfriendly 10.0 hydra-core 1.3.2 HyperPyYAML 1.2.2 identify 2.5.32 idna 3.4 imageio 2.33.0 imageio-ffmpeg 0.4.9 imgaug 0.4.0 importlib-metadata 6.8.0 inflect 7.0.0 iniconfig 2.0.0 iopath 0.1.9 ipdb 0.13.13 ipykernel 6.26.0 ipython 8.17.2 isoduration 20.11.0 isort 5.12.0 itsdangerous 2.2.0 jaconv 0.3.4 jamo 0.4.1 jedi 0.19.1 jieba 0.42.1 Jinja2 3.1.3 jmespath 0.10.0 joblib 1.3.2 json-tricks 3.17.3 json5 0.9.14 jsonpatch 1.32 jsonplus 0.8.0 jsonpointer 2.1 jsonschema 4.20.0 jsonschema-specifications 2023.11.1 jupyter_client 8.6.0 jupyter_core 5.5.0 jupyter-events 0.9.0 jupyter-lsp 2.2.1 jupyter_server 2.10.1 jupyter_server_terminals 0.4.4 jupyterlab 4.2.4 jupyterlab_pygments 0.3.0 jupyterlab_server 2.27.3 kaldiio 2.18.0 kantts 1.0.1 keras 2.14.0 kiwisolver 1.4.5 kornia 0.7.0 kwsbp 0.0.6 langcodes 3.3.0 langdetect 1.0.9 lap 0.4.0 lazy_loader 0.4 libclang 16.0.6 libmambapy 1.5.1 librosa 0.10.1 lightning-utilities 0.10.0 llvmlite 0.41.1 lmdb 1.4.1 loguru 0.7.2 lpips 0.1.4 lxml 4.9.3 lyft-dataset-sdk 0.0.8 magic-pdf 0.7.0b1 Markdown 3.5.1 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.1 matplotlib-inline 0.1.6 mccabe 0.7.0 mdurl 0.1.2 megatron-util 1.3.2 MinDAEC 0.0.2 minio 7.2.7 mir-eval 0.7 mistune 3.0.2 ml-collections 0.1.1 ml-dtypes 0.2.0 mmcls 0.25.0 mmcv-full 1.7.0 mmdet 2.28.2 mmdet3d 1.0.0a1 mmsegmentation 0.30.0 mock 5.1.0 modelscope 1.10.0 more-itertools 10.3.0 moviepy 1.0.3 mpi4py 3.1.5 mpmath 1.3.0 ms-swift 1.4.0 msgpack 1.0.7 multidict 6.0.4 multiprocess 0.70.14 MultiScaleDeformableAttention 1.0 murmurhash 1.0.10 mypy-extensions 1.0.0 nbclient 0.9.0 nbconvert 7.11.0 nbformat 5.9.2 nerfacc 0.2.2 nest-asyncio 1.5.8 networkx 3.2.1 ninja 1.11.1.1 nltk 3.8.1 nodeenv 1.8.0 notebook_shim 0.2.3 numba 0.58.1 numpy 1.26.4 nuscenes-devkit 1.1.11 nvdiffrast 0.3.1 nvidia-cublas-cu11 11.11.3.6 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu11 11.8.87 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu11 11.8.89 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu11 11.8.89 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu11 8.7.0.84 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu11 10.9.0.58 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu11 10.3.0.86 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu11 11.4.1.48 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu11 11.7.5.86 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu11 2.20.5 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.5.82 nvidia-nvtx-cu11 11.8.86 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 omegaconf 2.3.0 onnx 1.15.0 onnxruntime 1.16.3 onnxsim 0.4.35 open-clip-torch 2.23.0 opencv-contrib-python 4.6.0.66 opencv-python 4.6.0.66 opencv-python-headless 4.10.0.84 openpyxl 3.1.5 opt-einsum 3.3.0 optimum 1.14.1 oss2 2.18.3 overrides 7.4.0 packaging 23.1 paddleocr 2.7.3 paddlepaddle 3.0.0b1 pai-easycv 0.11.6 paint-ldm 0.0.0 pandas 2.2.2 pandocfilters 1.5.0 panopticapi 0.1 parso 0.8.3 pathspec 0.11.2 pdf2docx 0.5.8 pdf2image 1.17.0 pdfminer.six 20231228 peft 0.6.2 pexpect 4.8.0 pickleshare 0.7.5 pillow 10.2.0 pip 24.2 platformdirs 4.0.0 plotly 5.18.0 pluggy 1.0.0 plyfile 1.0.2 pointnet2 0.0.0 pooch 1.8.0 portalocker 2.8.2 pre-commit 3.5.0 premailer 3.10.0 preshed 3.0.9 prettytable 3.9.0 proglog 0.1.10 prometheus-client 0.19.0 prompt-toolkit 3.0.41 protobuf 3.20.3 psutil 5.9.6 ptflops 0.7.1.2 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 py-sound-connect 0.2.1 pyarrow 14.0.1 pyarrow-hotfix 0.6 pyasn1 0.5.0 pyasn1-modules 0.3.0 pybind11 2.11.1 pyclipper 1.3.0.post5 pycocoevalcap 1.2 pycocotools 2.0.7 pycodestyle 2.11.1 pycosat 0.6.6 pycparser 2.21 pycryptodome 3.19.0 pycryptodomex 3.19.0 pydantic 2.8.2 pydantic_core 2.20.1 pydeck 0.9.1 pyDeprecate 0.3.2 pydot 1.4.2 pyflakes 3.1.0 Pygments 2.16.1 PyMCubes 0.1.4 PyMuPDF 1.24.9 PyMuPDFb 1.24.9 PyMySQL 1.1.1 pynini 2.1.5 pynndescent 0.5.11 pynvml 11.5.0 pyOpenSSL 23.2.0 pypandoc 1.13 pyparsing 3.1.1 pypdfium2 4.30.0 pyquaternion 0.9.9 PySocks 1.7.1 pysptk 0.1.18 pytest 7.4.3 pythainlp 4.0.2 python-crfsuite 0.9.9 python-dateutil 2.8.2 python-docx 1.1.2 python-dotenv 1.0.0 python-json-logger 2.0.7 pytorch-lightning 1.7.7 pytorch-metric-learning 2.3.0 pytorch-wavelets 1.3.0 pytorch-wpe 0.0.1 pytorch3d 0.7.5 pytz 2023.3.post1 pyvi 0.1.1 PyWavelets 1.5.0 PyYAML 6.0.1 pyzmq 25.1.1 qudida 0.0.4 rapidfuzz 3.9.4 rarfile 4.2 ray 2.8.0 referencing 0.31.0 regex 2023.10.3 requests 2.31.0 requests-oauthlib 1.3.1 resampy 0.4.2 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.7.1 robust-downloader 0.0.2 rotary-embedding-torch 0.4.0 rouge 1.0.1 rouge-score 0.0.4 rpds-py 0.13.1 rsa 4.9 ruamel.yaml 0.18.5 ruamel.yaml.clib 0.2.8 s3transfer 0.8.2 sacrebleu 2.3.2 sacremoses 0.1.1 safetensors 0.4.3 scikit-image 0.24.0 scikit-learn 1.3.2 scipy 1.11.3 seaborn 0.13.0 Send2Trash 1.8.2 sentencepiece 0.1.99 seqeval 1.2.2 setuptools 68.0.0 Shapely 1.8.4 shotdetect-scenedetect-lgss 0.0.4 simplejson 3.19.2 six 1.16.0 sklearn-crfsuite 0.3.6 smart-open 6.4.0 smmap 5.0.1 smplx 0.1.28 sniffio 1.3.0 sortedcontainers 2.4.0 soundfile 0.12.1 soupsieve 2.5 sox 1.4.1 soxr 0.3.7 spacy 3.7.2 spacy-legacy 3.0.12 spacy-loggers 1.0.5 speechbrain 0.5.16 srsly 2.4.8 stack-data 0.6.3 stanza 1.7.0 starlette 0.27.0 streamlit 1.36.0 streamlit-drawable-canvas 0.9.3 struct-eqtable 0.1.0 subword-nmt 0.3.8 sympy 1.12 tabulate 0.9.0 taming-transformers-rom1504 0.0.6 tb-nightly 2.16.0a20231127 tenacity 8.2.3 tensorboard 2.14.1 tensorboard-data-server 0.7.2 tensorboardX 2.6.2.2 tensordict 0.2.1 tensorflow 2.14.0 tensorflow-estimator 2.14.0 tensorflow-io-gcs-filesystem 0.34.0 termcolor 2.4.0 terminado 0.18.0 terminaltables 3.1.10 text2sql-lgesql 1.3.0 tf-keras-nightly 2.16.0.dev2023112710 tf-slim 1.1.0 thinc 8.2.1 thop 0.1.1.post2209072238 threadpoolctl 3.2.0 tifffile 2023.9.26 tiktoken 0.5.1 timm 0.9.16 tinycss2 1.2.1 tinycudann 1.7+cu118 tokenizers 0.19.1 toml 0.10.2 tomli 2.0.1 toolz 0.12.1 torch 2.3.1+cu118 torch-complex 0.4.3 torch-scatter 2.1.2 torchaudio 2.1.0+cu118 torchdata 0.7.0 torchmetrics 0.11.4 torchsde 0.2.6 torchsummary 1.5.1 torchtext 0.18.0 torchvision 0.18.1+cu118 tornado 6.3.3 tqdm 4.65.0 traitlets 5.13.0 trampoline 0.1.2 transformers 4.40.0 transformers-stream-generator 0.0.4 trimesh 2.35.39 triton 2.3.1 truststore 0.8.0 ttsfrd 0.2.1 typeguard 2.13.3 typer 0.9.0 types-python-dateutil 2.8.19.14 typing 3.7.4.3 typing_extensions 4.9.0 tzdata 2023.3 ujson 5.8.0 ultralytics 8.2.64 ultralytics-thop 2.0.0 umap 0.1.1 umap-learn 0.5.5 unicodedata2 15.1.0 unicore 1.2.1 Unidecode 1.3.7 unimernet 0.1.6 uri-template 1.3.0 urllib3 1.26.16 utils 1.0.1 uvicorn 0.24.0.post1 uvloop 0.19.0 videofeatures-clipit 1.0 virtualenv 20.25.0 visualdl 2.5.3 vllm 0.2.1+cu118torch2.1 waitress 3.0.0 Wand 0.6.13 wasabi 1.1.2 watchdog 4.0.1 watchfiles 0.21.0 wcwidth 0.2.12 weasel 0.3.4 webcolors 1.13 webdataset 0.2.86 webencodings 0.5.1 websocket-client 1.6.4 websockets 12.0 Werkzeug 3.0.1 wget 3.2 wheel 0.41.2 wordninja 2.0.0 wrapt 1.14.1 xformers 0.0.22.post7+cu118 xtcocotools 1.14 xxhash 3.4.1 yacs 0.1.8 yapf 0.30.0 yarl 1.9.3 zhconv 1.4.3 zipp 3.17.0 zstandard 0.19.0

myhloli commented 2 weeks ago

@Maple0709 你的依赖列表有551个包,我们测试在清洁环境安装的话应该只有170个包左右,推测可能是您在这个环境内安装了一些别的包,破坏了运行环境,建议可以创建一个新的conda环境从头安装一下