opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.24k stars 840 forks source link

升级magic pdf最新版本后出现Segmentation fault #543

Open randydl opened 1 week ago

randydl commented 1 week ago

Description of the bug | 错误描述

(data) (base) randy@3080Ti:/mnt/projects/settings$ magic-pdf -p small_ocr.pdf 2024-09-04 00:07:47.984 INFO magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 8, cid_chars_radio: 0.0 2024-09-04 00:07:47.985 WARNING magic_pdf.filter.pdf_classify_by_type:classify:334 - pdf is not classified by area and text_len, by_image_area: False, by_text: False, by_avg_words: False, by_img_num: True, by_text_layout: False, by_img_narrow_strips: False, by_invalid_chars: True WARNING ⚠️ Ultralytics settings reset to default values. This may be due to a possible problem with your settings or a recent ultralytics package update. View settings with 'yolo settings' or at '/home/randy/.config/Ultralytics/settings.yaml' Update settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings. 2024-09-04 00:07:51.653 INFO magic_pdf.model.pdf_extract_kit:init:121 - DocAnalysis init, this may take some times. apply_layout: True, apply_formula: True, apply_ocr: True, apply_table: False 2024-09-04 00:07:51.653 INFO magic_pdf.model.pdf_extract_kit:init:129 - using device: cuda 2024-09-04 00:07:51.653 INFO magic_pdf.model.pdf_extract_kit:init:131 - using models_dir: /mnt/models/PDF-Extract-Kit/models CustomVisionEncoderDecoderModel init CustomMBartForCausalLM init CustomMBartDecoder init [09/04 00:07:57 detectron2]: Rank of current process: 0. World size: 1 [09/04 00:07:57 detectron2]: Environment info:
sys.platform linux Python 3.10.14 packaged by conda-forge (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] numpy 1.26.4 detectron2 0.6 @/mnt/miniconda3/envs/data/lib/python3.10/site-packages/detectron2 Compiler GCC 11.4 CUDA compiler not available DETECTRON2_ENV_MODULE PyTorch 2.3.1+cu121 @/mnt/miniconda3/envs/data/lib/python3.10/site-packages/torch PyTorch debug build False torch._C._GLIBCXX_USE_CXX11_ABI False GPU available Yes GPU 0 NVIDIA GeForce RTX 3080 Ti (arch=8.6) Driver version 535.183.01 CUDA_HOME /usr/local/cuda Pillow 10.4.0 torchvision 0.18.1+cu121 @/mnt/miniconda3/envs/data/lib/python3.10/site-packages/torchvision torchvision arch flags 5.0, 6.0, 7.0, 7.5, 8.0, 8.6, 9.0 fvcore 0.1.5.post20221221 iopath 0.1.9 cv2 4.6.0

PyTorch built with:

[09/04 00:07:57 detectron2]: Command line arguments: {'config_file': '/mnt/miniconda3/envs/data/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml', 'resume': False, 'eval_only': False, 'num_gpus': 1, 'num_machines': 1, 'machine_rank': 0, 'dist_url': 'tcp://127.0.0.1:57823', 'opts': ['MODEL.WEIGHTS', '/mnt/models/PDF-Extract-Kit/models/Layout/model_final.pth']} [09/04 00:07:57 detectron2]: Contents of args.config_file=/mnt/miniconda3/envs/data/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layoutlmv3_base_inference.yaml: AUG: DETR: true CACHE_DIR: ~/cache/huggingface CUDNN_BENCHMARK: false DATALOADER: ASPECT_RATIO_GROUPING: true FILTER_EMPTY_ANNOTATIONS: false NUM_WORKERS: 4 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST:

[09/04 00:07:58 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /mnt/models/PDF-Extract-Kit/models/Layout/model_final.pth ... [09/04 00:07:58 fvcore.common.checkpoint]: [Checkpointer] Loading from /mnt/models/PDF-Extract-Kit/models/Layout/model_final.pth ... 2024-09-04 00:07:59.048 | INFO | magic_pdf.model.pdf_extract_kit:init:159 - DocAnalysis init done! 2024-09-04 00:07:59.049 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:98 - model init cost: 11.06343150138855


C++ Traceback (most recent call last):

0 at::_ops::conv2d::call(at::Tensor const&, at::Tensor const&, std::optional const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, c10::SymInt) 1 at::native::conv2d_symint(at::Tensor const&, at::Tensor const&, std::optional const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, c10::SymInt) 2 at::_ops::convolution::call(at::Tensor const&, at::Tensor const&, std::optional const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, c10::SymInt) 3 at::_ops::convolution::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, std::optional const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, c10::SymInt) 4 at::native::convolution(at::Tensor const&, at::Tensor const&, std::optional const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long) 5 at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, std::optional const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, c10::SymInt, bool, bool, bool, bool) 6 at::native::_convolution(at::Tensor const&, at::Tensor const&, std::optional const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, bool, c10::ArrayRef, long, bool, bool, bool, bool) 7 at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, c10::SymInt, bool, bool, bool) 8 at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, bool)


Error Message Summary:

FatalError: Segmentation fault is detected by the operating system. [TimeInfo: Aborted at 1725379679 (unix time) try "date -d @1725379679" if you are using GNU date ] [SignalInfo: SIGSEGV (@0x20000002ef4) received by PID 13766 (TID 0x7fcbb9641740) from PID 12020 ]

Segmentation fault

How to reproduce the bug | 如何复现

环境完全按照教程提示配置,依然报错 (data) (base) randy@3080Ti:/mnt/projects/settings$ pip list Package Version


absl-py 2.1.0 aiohappyeyeballs 2.4.0 aiohttp 3.10.5 aiosignal 1.3.1 albucore 0.0.14 albumentations 1.4.14 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 anyio 4.4.0 astor 0.8.1 async-timeout 4.0.3 attrdict 2.0.1 attrs 24.2.0 babel 2.16.0 bce-python-sdk 0.9.21 beautifulsoup4 4.12.3 black 24.8.0 blinker 1.8.2 boto3 1.35.10 botocore 1.35.10 braceexpand 0.1.7 Brotli 1.1.0 cachetools 5.5.0 certifi 2024.8.30 cffi 1.17.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 colorlog 6.8.2 contourpy 1.3.0 cryptography 43.0.0 cssselect 1.2.0 cssutils 2.11.1 cycler 0.12.1 Cython 3.0.11 datasets 2.21.0 decorator 5.1.1 detectron2 0.6 dill 0.3.8 et-xmlfile 1.1.0 eva-decord 0.6.1 eval_type_backport 0.2.0 evaluate 0.4.2 exceptiongroup 1.2.2 fairscale 0.4.13 fast-langdetect 0.2.0 fasttext-wheel 0.9.2 filelock 3.15.4 fire 0.6.0 Flask 3.0.3 flask-babel 4.0.0 fonttools 4.53.1 frozenlist 1.4.1 fsspec 2024.6.1 ftfy 6.2.3 future 1.0.0 fvcore 0.1.5.post20221221 grpcio 1.66.1 h11 0.14.0 httpcore 1.0.5 httpx 0.27.2 huggingface-hub 0.24.6 hydra-core 1.3.2 idna 3.8 imageio 2.35.1 imgaug 0.4.0 iopath 0.1.9 itsdangerous 2.2.0 Jinja2 3.1.4 jmespath 1.0.1 joblib 1.4.2 kiwisolver 1.4.5 langdetect 1.0.9 lazy_loader 0.4 lmdb 1.5.1 loguru 0.7.2 lxml 5.3.0 magic-pdf 0.7.1 Markdown 3.7 MarkupSafe 2.1.5 matplotlib 3.9.2 more-itertools 10.4.0 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 mypy-extensions 1.0.0 networkx 3.3 numpy 1.26.4 nvidia-cublas-cu11 11.11.3.6 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu11 11.8.87 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu11 11.8.89 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu11 11.8.89 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu11 8.7.0.84 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu11 10.9.0.58 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu11 10.3.0.86 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu11 11.4.1.48 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu11 11.7.5.86 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu11 2.19.3 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.68 nvidia-nvtx-cu11 11.8.86 nvidia-nvtx-cu12 12.1.105 omegaconf 2.3.0 opencv-contrib-python 4.6.0.66 opencv-python 4.6.0.66 opencv-python-headless 4.10.0.84 openpyxl 3.1.5 opt-einsum 3.3.0 packaging 24.1 paddleocr 2.7.3 paddlepaddle 3.0.0b1 paddlepaddle-gpu 3.0.0b1 pandas 2.2.2 pathspec 0.12.1 pdf2docx 0.5.8 pdfminer.six 20231228 pillow 10.4.0 pip 24.2 platformdirs 4.2.2 portalocker 2.10.1 premailer 3.10.0 protobuf 5.28.0 psutil 6.0.0 py-cpuinfo 9.0.0 pyarrow 17.0.0 pybind11 2.13.5 pyclipper 1.3.0.post5 pycocotools 2.0.8 pycparser 2.22 pycryptodome 3.20.0 pydantic 2.8.2 pydantic_core 2.20.1 PyMuPDF 1.24.10 PyMuPDFb 1.24.10 pypandoc 1.13 pyparsing 3.1.4 python-dateutil 2.9.0.post0 python-docx 1.1.2 pytz 2024.1 PyYAML 6.0.2 rapidfuzz 3.9.7 rarfile 4.2 regex 2024.7.24 requests 2.32.3 robust-downloader 0.0.2 s3transfer 0.10.2 safetensors 0.4.4 scikit-image 0.24.0 scikit-learn 1.5.1 scipy 1.14.1 seaborn 0.13.2 setuptools 73.0.1 shapely 2.0.6 six 1.16.0 sniffio 1.3.1 soupsieve 2.6 struct-eqtable 0.1.0 sympy 1.13.2 tabulate 0.9.0 tensorboard 2.17.1 tensorboard-data-server 0.7.2 termcolor 2.4.0 threadpoolctl 3.5.0 tifffile 2024.8.30 timm 0.9.16 tokenizers 0.19.1 tomli 2.0.1 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tqdm 4.66.5 transformers 4.40.0 triton 2.3.1 typing_extensions 4.12.2 tzdata 2024.1 ultralytics 8.2.87 ultralytics-thop 2.0.6 unimernet 0.1.6 urllib3 2.2.2 visualdl 2.5.3 Wand 0.6.13 wcwidth 0.2.13 webdataset 0.2.100 Werkzeug 3.0.4 wheel 0.44.0 wordninja 2.0.0 xxhash 3.5.0 yacs 0.1.8 yarl 1.9.7

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.7.x

Device mode | 设备模式

cuda

randydl commented 1 week ago

pip install -U magic-pdf[full] \ "/nas_data/userdata/tools/detectron2-0.6-cp310-cp310-linux_x86_64.whl" \ "/nas_data/userdata/tools/paddlepaddle_gpu-3.0.0b1-cp310-cp310-linux_x86_64.whl"

whl包是从安装教程中的网站下载的

randydl commented 1 week ago

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A40 On | 00000000:34:00.0 Off | 0 | | 0% 40C P0 79W / 300W | 3370MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A40 On | 00000000:35:00.0 Off | 0 | | 0% 40C P0 80W / 300W | 3370MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

bingyunsky commented 1 week ago

同样的问题,执行完python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ 出现