opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://mineru.readthedocs.io/
GNU Affero General Public License v3.0
13.76k stars 1.03k forks source link

'magic-pdf' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 #333

Closed RGthx closed 3 months ago

RGthx commented 3 months ago

Description of the bug | 错误描述

虽然我按教程部署了虚拟环境以及下载了对应的模型文件等并配置好 但是命令行操作模式并不可用;也无法使用magic-pdf --version等命令查看;然而我运行示例demo.py文件是可以正常运行并输出预期md文件的 是我需要对环境变量等什么修改吗?

How to reproduce the bug | 如何复现

(MinerU) C:\Users\rgthx\Downloads\MinerU-master\MinerU-master\demo>python demo.py 2024-08-06 00:07:26.246 INFO magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 9, text_len: 33962, cid_chars_radio: 0.00026542408871062874 INFO:datasets:PyTorch version 2.3.1 available. 2024-08-06 00:07:35.988 INFO magic_pdf.model.pdf_extract_kit:init:99 - DocAnalysis init, this may take some times. apply_layout: True, apply_formula: True, apply_ocr: False 2024-08-06 00:07:35.989 INFO magic_pdf.model.pdf_extract_kit:init:107 - using device: cpu 2024-08-06 00:07:35.989 INFO magic_pdf.model.pdf_extract_kit:init:109 - using models_dir: D:/Anaconda/envs/MinerU/models CustomVisionEncoderDecoderModel init CustomMBartForCausalLM init CustomMBartDecoder init [08/06 00:07:45 detectron2]: Rank of current process: 0. World size: 1 [08/06 00:07:45 detectron2]: Environment info:
sys.platform win32 Python 3.10.14 packaged by Anaconda, Inc. (main, May 6 2024, 19:44:50) [MSC v.1916 64 bit (AMD64)] numpy 1.26.4 detectron2 0.6 @C:\Users\rgthx\AppData\Roaming\Python\Python310\site-packages\detectron2 Compiler MSVC 194033811 CUDA compiler not available DETECTRON2_ENV_MODULE PyTorch 2.3.1+cpu @C:\Users\rgthx\AppData\Roaming\Python\Python310\site-packages\torch PyTorch debug build False torch._C._GLIBCXX_USE_CXX11_ABI False GPU available No: torch.cuda.is_available() == False Pillow 10.4.0 torchvision 0.18.1+cpu @C:\Users\rgthx\AppData\Roaming\Python\Python310\site-packages\torchvision fvcore 0.1.5.post20221221 iopath 0.1.9 cv2 4.6.0

PyTorch built with:

[08/06 00:07:45 detectron2]: Command line arguments: {'config_file': 'C:\Users\rgthx\AppData\Roaming\Python\Python310\site-packages\magic_pdf\resources\model_config\layoutlmv3\layoutlmv3_base_inference.yaml', 'resume': False, 'eval_only': False, 'num_gpus': 1, 'num_machines': 1, 'machine_rank': 0, 'dist_url': 'tcp://127.0.0.1:57823', 'opts': ['MODEL.WEIGHTS', 'D:/Anaconda/envs/MinerU/models\Layout/model_final.pth']} [08/06 00:07:45 detectron2]: Contents of args.config_file=C:\Users\rgthx\AppData\Roaming\Python\Python310\site-packages\magic_pdf\resources\model_config\layoutlmv3\layoutlmv3_base_inference.yaml: AUG: DETR: true CACHE_DIR: ~/cache/huggingface CUDNN_BENCHMARK: false DATALOADER: ASPECT_RATIO_GROUPING: true FILTER_EMPTY_ANNOTATIONS: false NUM_WORKERS: 4 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST:

[08/06 00:07:46 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from D:/Anaconda/envs/MinerU/models\Layout/model_final.pth ... [08/06 00:07:46 fvcore.common.checkpoint]: [Checkpointer] Loading from d:/Anaconda/envs/MinerU/models\Layout/model_final.pth ... 2024-08-06 00:07:47.665 | INFO | magic_pdf.model.pdf_extract_kit:init:132 - DocAnalysis init done! 2024-08-06 00:07:47.666 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:92 - model init cost: 21.418904542922974 2024-08-06 00:08:02.520 | INFO | magic_pdf.model.pdf_extract_kit:call:143 - layout detection cost: 14.49

0: 1888x1408 7 embeddings, 5043.3ms Speed: 27.0ms preprocess, 5043.3ms inference, 1.0ms postprocess per image at shape (1, 3, 1888, 1408) 2024-08-06 00:08:14.034 | INFO | magic_pdf.model.pdf_extract_kit:call:173 - formula nums: 7, mfr time: 4.14 2024-08-06 00:08:35.911 | INFO | magic_pdf.model.pdf_extract_kit:call:143 - layout detection cost: 21.88

0: 1888x1408 3 embeddings, 6740.8ms Speed: 28.5ms preprocess, 6740.8ms inference, 1.0ms postprocess per image at shape (1, 3, 1888, 1408) 2024-08-06 00:08:46.955 | INFO | magic_pdf.model.pdf_extract_kit:call:173 - formula nums: 3, mfr time: 4.25 2024-08-06 00:09:16.613 | INFO | magic_pdf.model.pdf_extract_kit:call:143 - layout detection cost: 29.66

0: 1888x1408 18 embeddings, 2 isolateds, 6887.6ms Speed: 26.2ms preprocess, 6887.6ms inference, 1.0ms postprocess per image at shape (1, 3, 1888, 1408) 2024-08-06 00:09:34.476 | INFO | magic_pdf.model.pdf_extract_kit:call:173 - formula nums: 20, mfr time: 10.83 2024-08-06 00:09:49.655 | INFO | magic_pdf.model.pdf_extract_kit:call:143 - layout detection cost: 15.18

0: 1888x1408 32 embeddings, 4 isolateds, 3605.5ms Speed: 24.7ms preprocess, 3605.5ms inference, 1.0ms postprocess per image at shape (1, 3, 1888, 1408) 2024-08-06 00:10:25.533 | INFO | magic_pdf.model.pdf_extract_kit:call:173 - formula nums: 36, mfr time: 32.08 2024-08-06 00:10:53.225 | INFO | magic_pdf.model.pdf_extract_kit:call:143 - layout detection cost: 27.69

0: 1888x1408 7 embeddings, 1 isolated, 5715.7ms Speed: 30.6ms preprocess, 5715.7ms inference, 1.0ms postprocess per image at shape (1, 3, 1888, 1408) 2024-08-06 00:11:11.822 | INFO | magic_pdf.model.pdf_extract_kit:call:173 - formula nums: 8, mfr time: 12.79 2024-08-06 00:11:39.676 | INFO | magic_pdf.model.pdf_extract_kit:call:143 - layout detection cost: 27.85

0: 1888x1408 6 embeddings, 5515.7ms Speed: 26.1ms preprocess, 5515.7ms inference, 2.0ms postprocess per image at shape (1, 3, 1888, 1408) 2024-08-06 00:11:50.873 | INFO | magic_pdf.model.pdf_extract_kit:call:173 - formula nums: 6, mfr time: 5.62 2024-08-06 00:12:18.559 | INFO | magic_pdf.model.pdf_extract_kit:call:143 - layout detection cost: 27.69

0: 1888x1408 20 embeddings, 5642.8ms Speed: 27.1ms preprocess, 5642.8ms inference, 2.0ms postprocess per image at shape (1, 3, 1888, 1408) 2024-08-06 00:12:41.888 | INFO | magic_pdf.model.pdf_extract_kit:call:173 - formula nums: 20, mfr time: 17.55 2024-08-06 00:12:57.379 | INFO | magic_pdf.model.pdf_extract_kit:call:143 - layout detection cost: 15.49

0: 1888x1408 7 embeddings, 4560.2ms Speed: 15.2ms preprocess, 4560.2ms inference, 2.2ms postprocess per image at shape (1, 3, 1888, 1408) 2024-08-06 00:13:05.562 | INFO | magic_pdf.model.pdf_extract_kit:call:173 - formula nums: 7, mfr time: 3.57 2024-08-06 00:13:22.253 | INFO | magic_pdf.model.pdf_extract_kit:call:143 - layout detection cost: 16.69

0: 1888x1408 15 embeddings, 4624.4ms Speed: 26.3ms preprocess, 4624.4ms inference, 1.0ms postprocess per image at shape (1, 3, 1888, 1408) 2024-08-06 00:13:34.929 | INFO | magic_pdf.model.pdf_extract_kit:call:173 - formula nums: 15, mfr time: 7.94 2024-08-06 00:13:51.540 | INFO | magic_pdf.model.pdf_extract_kit:call:143 - layout detection cost: 16.61

0: 1888x1408 1 embedding, 6940.7ms Speed: 25.4ms preprocess, 6940.7ms inference, 2.0ms postprocess per image at shape (1, 3, 1888, 1408) 2024-08-06 00:14:00.052 | INFO | magic_pdf.model.pdf_extract_kit:call:173 - formula nums: 1, mfr time: 1.54 2024-08-06 00:14:27.619 | INFO | magic_pdf.model.pdf_extract_kit:call:143 - layout detection cost: 27.57

0: 1888x1408 4 embeddings, 5760.3ms Speed: 29.9ms preprocess, 5760.3ms inference, 1.4ms postprocess per image at shape (1, 3, 1888, 1408) 2024-08-06 00:14:38.158 | INFO | magic_pdf.model.pdf_extract_kit:call:173 - formula nums: 4, mfr time: 4.72 2024-08-06 00:15:04.580 | INFO | magic_pdf.model.pdf_extract_kit:call:143 - layout detection cost: 26.42

0: 1888x1408 1 embedding, 3774.9ms Speed: 25.6ms preprocess, 3774.9ms inference, 0.0ms postprocess per image at shape (1, 3, 1888, 1408) 2024-08-06 00:15:09.153 | INFO | magic_pdf.model.pdf_extract_kit:call:173 - formula nums: 1, mfr time: 0.77 2024-08-06 00:15:24.270 | INFO | magic_pdf.model.pdf_extract_kit:call:143 - layout detection cost: 15.12

0: 1888x1408 (no detections), 5563.2ms Speed: 26.3ms preprocess, 5563.2ms inference, 1.3ms postprocess per image at shape (1, 3, 1888, 1408) 2024-08-06 00:15:29.863 | INFO | magic_pdf.model.pdf_extract_kit:call:173 - formula nums: 0, mfr time: 0.0 2024-08-06 00:15:29.865 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:118 - doc analyze cost: 461.83641028404236 2024-08-06 00:15:33.156 | INFO | magic_pdf.pipe.UNIPipe:pipe_mk_markdown:48 - uni_pipe mk mm_markdown finished

(MinerU) C:\Users\rgthx\Downloads\MinerU-master\MinerU-master\demo>magic-pdf --help 'magic-pdf' 不是内部或外部命令,也不是可运行的程序 或批处理文件。

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cpu

myhloli commented 3 months ago

pip list看看有没有安装magic-pdf的包?

RGthx commented 3 months ago

使用pip list列出的列表如下图,里面是有magic-pdf的包的(不然demo.py也跑不起来) ` C:\Users\rgthx>conda activate MinerU

(MinerU) C:\Users\rgthx>pip list Package Version


absl-py 2.1.0 aiohappyeyeballs 2.3.4 aiohttp 3.10.1 aiosignal 1.3.1 albucore 0.0.13 albumentations 1.4.12 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 anyio 4.4.0 astor 0.8.1 async-timeout 4.0.3 attrdict 2.0.1 attrs 24.1.0 Babel 2.15.0 bce-python-sdk 0.9.19 beautifulsoup4 4.12.3 black 24.8.0 blinker 1.8.2 boto3 1.34.153 botocore 1.34.153 braceexpand 0.1.7 Brotli 1.1.0 cachetools 5.4.0 certifi 2024.7.4 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 colorama 0.4.6 colorlog 6.8.2 contourpy 1.2.1 cryptography 43.0.0 cssselect 1.2.0 cssutils 2.11.1 cycler 0.12.1 Cython 3.0.11 datasets 2.20.0 decorator 5.1.1 detectron2 0.6 dill 0.3.8 et-xmlfile 1.1.0 eva-decord 0.6.1 eval_type_backport 0.2.0 evaluate 0.4.2 exceptiongroup 1.2.2 fairscale 0.4.13 fast-langdetect 0.2.0 fasttext-wheel 0.9.2 filelock 3.15.4 fire 0.6.0 Flask 3.0.3 flask-babel 4.0.0 fonttools 4.53.1 frozenlist 1.4.1 fsspec 2024.5.0 ftfy 6.2.0 future 1.0.0 fvcore 0.1.5.post20221221 grpcio 1.65.4 h11 0.14.0 httpcore 1.0.5 httpx 0.27.0 huggingface-hub 0.24.5 hydra-core 1.3.2 idna 3.7 imageio 2.34.2 imgaug 0.4.0 intel-openmp 2021.4.0 iopath 0.1.9 itsdangerous 2.2.0 Jinja2 3.1.4 jmespath 1.0.1 joblib 1.4.2 kiwisolver 1.4.5 langdetect 1.0.9 lazy_loader 0.4 lmdb 1.5.1 loguru 0.7.2 lxml 5.2.2 magic-pdf 0.6.2b1 Markdown 3.6 MarkupSafe 2.1.5 matplotlib 3.9.0 mkl 2021.4.0 more-itertools 10.3.0 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 mypy-extensions 1.0.0 networkx 3.3 numpy 1.26.4 omegaconf 2.3.0 opencv-contrib-python 4.6.0.66 opencv-python 4.6.0.66 opencv-python-headless 4.10.0.84 openpyxl 3.1.5 opt-einsum 3.3.0 packaging 24.1 paddleocr 2.7.3 paddlepaddle 2.6.1 pandas 2.2.2 pathspec 0.12.1 pdf2docx 0.5.8 pdfminer.six 20231228 pillow 10.4.0 pip 24.0 platformdirs 4.2.2 portalocker 2.10.1 premailer 3.10.0 protobuf 3.20.2 psutil 6.0.0 py-cpuinfo 9.0.0 pyarrow 17.0.0 pyarrow-hotfix 0.6 pybind11 2.13.1 pyclipper 1.3.0.post5 pycocotools 2.0.8 pycparser 2.22 pycryptodome 3.20.0 pydantic 2.8.2 pydantic_core 2.20.1 PyMuPDF 1.24.9 PyMuPDFb 1.24.9 pyparsing 3.1.2 python-dateutil 2.9.0.post0 python-docx 1.1.2 pytz 2024.1 pywin32 306 PyYAML 6.0.1 rapidfuzz 3.9.5 rarfile 4.2 regex 2024.7.24 requests 2.32.3 robust-downloader 0.0.2 s3transfer 0.10.2 safetensors 0.4.4 scikit-image 0.24.0 scikit-learn 1.5.1 scipy 1.14.0 seaborn 0.13.2 setuptools 72.1.0 shapely 2.0.5 six 1.16.0 sniffio 1.3.1 soupsieve 2.5 sympy 1.13.1 tabulate 0.9.0 tbb 2021.13.0 tensorboard 2.17.0 tensorboard-data-server 0.7.2 termcolor 2.4.0 threadpoolctl 3.5.0 tifffile 2024.7.24 timm 0.9.16 tokenizers 0.19.1 tomli 2.0.1 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tqdm 4.66.5 transformers 4.40.0 typing_extensions 4.12.2 tzdata 2024.1 ultralytics 8.2.73 ultralytics-thop 2.0.0 unimernet 0.1.6 urllib3 2.2.2 visualdl 2.5.3 Wand 0.6.13 wcwidth 0.2.13 webdataset 0.2.86 Werkzeug 3.0.3 wheel 0.43.0 win32-setctime 1.1.0 wordninja 2.0.0 xxhash 3.4.1 yacs 0.1.8 yarl 1.9.4

(MinerU) C:\Users\rgthx>magic-pdf -v 'magic-pdf' 不是内部或外部命令,也不是可运行的程序 或批处理文件。

(MinerU) C:\Users\rgthx> `

RGthx commented 3 months ago

抱歉,解决了 解决方法是在管理员权限下的Anaconda prompt内激活对应虚拟环境并配置 我之前是直接终端内激活的环境;pip默认下载到了c盘里 参考:https://blog.csdn.net/m0_65634471/article/details/130297467