opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.22k stars 838 forks source link

"Illegal hardware instruction" error when running magic-pdf on macOS Sonoma with M1 chip #273

Open carllx opened 1 month ago

carllx commented 1 month ago

Description of the bug | 错误描述

When attempting to run magic-pdf command on macOS Sonoma (14.5) with Apple M1 chip, I encounter an "illegal hardware instruction" error. This occurs despite setting up the environment as per the installation instructions.

How to reproduce the bug | 如何复现

Set up the environment:

conda create -n MinerU python=3.10
pip install 'magic-pdf[full]==0.6.2b1'
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
git lfs clone https://huggingface.co/wanderkid/PDF-Extract-Kit

Configure magic-pdf.json:

{
    "bucket_info":{
        "bucket-name-1":["ak", "sk", "endpoint"],
        "bucket-name-2":["ak", "sk", "endpoint"]
    },
    "temp-output-dir":"/Users/usrname/Downloads",
    "models-dir":"/Volumes/SSD/llm/PDF-Extract-Kit/models",
    "device-mode":"mps" // 'cpu' also cause "Illegal hardware instruction" error.
}

Run the command:

magic-pdf pdf-command --pdf "/Users/usrname/Downloads/Anna's Archive.pdf" --inside_model true

Expected behavior: The command should process the PDF file.

Actual behavior: The command fails with the error.

[!warning] "illegal hardware instruction".

2024-08-01 00:19:21.281 | WARNING  | magic_pdf.cli.magicpdf:get_model_json:312 - not found json /Users/usrname/Downloads/Anna’s Archive.json existed
2024-08-01 00:19:22.760 | INFO     | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 10, cid_chars_radio: 0.0
2024-08-01 00:19:22.761 | WARNING  | magic_pdf.filter.pdf_classify_by_type:classify:334 - pdf is not classified by area and text_len, by_image_area: False, by_text: False, by_avg_words: False, by_img_num: True, by_text_layout: False, by_img_narrow_strips: True, by_invalid_chars: True
INFO:datasets:PyTorch version 2.2.2 available.
zsh: illegal hardware instruction  magic-pdf pdf-command --pdf  --inside_model true

System information:

macOS: Sonoma 14.5 Hardware: MacBook Air M1, 2020 Python: 3.10 magic-pdf version: 0.6.2b1

Additional attempts:

Added export PYTORCH_ENABLE_MPS_FALLBACK=1 to ~/.zshrc Ran export PYTORCH_ENABLE_MPS_FALLBACK=1 in the terminal

Neither of these attempts resolved the issue.

I would greatly appreciate any assistance in resolving this issue or suggestions for further troubleshooting steps. Thank you for your help!

Operating system | 操作系统

MacOS

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

mps

myhloli commented 1 month ago

first,check your python platform

import platform
print(platform.machine())

if the platform is arm64, continue second, try a new clean env from conda

conda create -n cleanMinerU python=3.10
conda activate cleanMinerU
pip install magic-pdf[full]==0.6.2b1
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
pip install torch==2.3.1 torchvision==0.18.1 torchtext==0.18.0
magic-pdf --version

if version is 0.6.2b1, then please try to parse https://github.com/opendatalab/MinerU/blob/master/demo/demo1.pdf by “cpu“ mode

{
"device-mode":"cpu"
}
magic-pdf pdf-command --pdf demo1.pdf
carllx commented 1 month ago

@myhloli thank you. I try again later.

carllx commented 1 month ago

@myhloli, I have been using an x86_64 platform, and upon executing your code, I confirmed that the platform is indeed x86_64. Subsequently, I attempted to reinstall anaconda.com: Download Now | Anaconda. by downloading 64-Bit (Apple silicon) Graphical Installer (704.7M) from the official source, Anaconda Download Now.

I followed the reinstallation process with the following commands:

pip install 'magic-pdf[full]==0.6.2b1'                          
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
pip install torch==2.3.1 torchvision==0.18.1 torchtext==0.18.0

After these steps, it appears that magic-pdf is able to start running. However, I am encountering an issue where PaddleOCR does not seem to be utilizing the correct GPU.

magic-pdf pdf-command --pdf "Anna’s Archive.pdf" --inside_model true

[08/02 16:20:52 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /Volumes/T7-carllx2T/llm/PDF-Extract-Kit/models/Layout/model_final.pth ...
[08/02 16:20:52 fvcore.common.checkpoint]: [Checkpointer] Loading from /Volumes/T7-carllx2T/llm/PDF-Extract-Kit/models/Layout/model_final.pth ...
download https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_det_infer.tar to /Users/yamlam/.paddleocr/whl/det/ch/ch_PP-OCRv4_det_infer/ch_PP-OCRv4_det_infer.tar
100%|██████████████████████████████████████| 4.89M/4.89M [00:12<00:00, 391kiB/s]
download https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_rec_infer.tar to /Users/yamlam/.paddleocr/whl/rec/ch/ch_PP-OCRv4_rec_infer/ch_PP-OCRv4_rec_infer.tar
100%|██████████████████████████████████████| 11.0M/11.0M [00:14<00:00, 732kiB/s]
download https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar to /Users/yamlam/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer/ch_ppocr_mobile_v2.0_cls_infer.tar
100%|█████████████████████████████████████| 2.19M/2.19M [00:01<00:00, 1.10MiB/s]
2024-08-02 16:21:25.682 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:132 - DocAnalysis init done!
2024-08-02 16:21:25.683 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:92 - model init cost: 68.8973319530487
Error: command buffer exited with error status.
    The Metal Performance Shaders operations encoded on it may not have completed.
    Error: 
    (null)
    Internal Error (0000000e:Internal Error)
    <AGXG13GFamilyCommandBuffer: 0x31fe20e30>
    label = <none> 
    device = <AGXG13GDevice: 0x12291de00>
        name = Apple M1 
    commandQueue = <AGXG13GFamilyCommandQueue: 0x122938800>
        label = <none> 
        device = <AGXG13GDevice: 0x12291de00>
            name = Apple M1 
    retainedReferences = 1
Error: command buffer exited with error status.
    The Metal Performance Shaders operations encoded on it may not have completed.
    Error: 
    (null)
    Internal Error (0000000e:Internal Error)
    <AGXG13GFamilyCommandBuffer: 0x1631786c0>
    label = <none> 
    device = <AGXG13GDevice: 0x12291de00>
        name = Apple M1 
    commandQueue = <AGXG13GFamilyCommandQueue: 0x122938800>
        label = <none> 
        device = <AGXG13GDevice: 0x12291de00>
            name = Apple M1 
    retainedReferences = 1
Error: command buffer exited with error status.
    The Metal Performance Shaders operations encoded on it may not have completed.
    Error: 
    (null)
    Internal Error (0000000e:Internal Error)
    <AGXG13GFamilyCommandBuffer: 0x31b958ad0>
    label = <none> 
    device = <AGXG13GDevice: 0x12291de00>
        name = Apple M1 
    commandQueue = <AGXG13GFamilyCommandQueue: 0x122938800>
        label = <none> 
        device = <AGXG13GDevice: 0x12291de00>
            name = Apple M1 
    retainedReferences = 1
myhloli commented 1 month ago

1,Apple M1 is arm64 not x86_64 2,we only support macos with cpu,mps will be not support