能不能给一个完整的使用说明

opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具，支持PDF/网页/多格式电子书提取。

https://opendatalab.com/OpenSourceTools

GNU Affero General Public License v3.0

13.09k stars 977 forks source link

能不能给一个完整的使用说明 #157

Open Pandas886 opened 3 months ago

Pandas886 commented 3 months ago

Is your feature request related to a problem? Please describe. 您的特性请求是否与某个问题相关？请描述。 A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 对存在的问题进行清晰且简洁的描述。例如：我一直很困扰的是 [...]

Describe the solution you'd like 描述您期望的解决方案 A clear and concise description of what you want to happen. 清晰且简洁地描述您希望实现的内容。

Describe alternatives you've considered 描述您已考虑的替代方案 A clear and concise description of any alternative solutions or features you've considered. 清晰且简洁地描述您已经考虑过的任何替代解决方案。

Additional context 提供更多细节 Add any other context or screenshots about the feature request here. 请附上任何相关截图、链接或文件，以帮助我们更好地理解您的请求。

myhloli commented 3 months ago

能否详细描述下您要的完整版说明需要包含哪些内容？

congweitao commented 3 months ago

执行pip install magic-pdf[full-cpu] 之后，仍然无法引用magic_pdf库能不能给一个完整的使用说明？

myhloli commented 3 months ago

执行pip install magic-pdf[full-cpu] 之后，仍然无法引用magic_pdf库能不能给一个完整的使用说明？

无法使用的情况请新开一个issues，按照模版详细描述遇到的问题，谢谢。

shizidushu commented 3 months ago

@congweitao @Pandas886 跑了下Demo，写了篇文档记录： https://zhuanlan.zhihu.com/p/709402502 或者有一点帮助

LKAMING97 commented 3 months ago

是否支持服务化部署呢？

lori-kuo commented 2 months ago

demo1.json是干什么的

myhloli commented 2 months ago

demo1.json是干什么的

是模型分析完pdf之后的中间数据。

BAMMBoo commented 2 months ago

通过命令行运行时，报错not found json咋整？？

2024-07-26 15:08:51.986 | WARNING | magic_pdf.cli.magicpdf:get_model_json:310 - not found json D:/Documents/动态系统建模_状态空间方程.json existed

myhloli commented 2 months ago

通过命令行运行时，报错not found json咋整？？

2024-07-26 15:08:51.986 | WARNING | magic_pdf.cli.magicpdf:get_model_json:310 - not found json D:/Documents/动态系统建模_状态空间方程.json existed

这只是个warning，没有json程序会自己生成一个

BAMMBoo commented 2 months ago

warning后边跟了两个ValueError ValueError: predict processes one line at a time (remove '\n') ValueError: Unable to avoid copy while creating an array as requested.

最后运行结果是个空文件夹

myhloli commented 2 months ago

warning后边跟了两个ValueError

ValueError: predict processes one line at a time (remove '\n')

ValueError: Unable to avoid copy while creating an array as requested.

最后运行结果是个空文件夹

没见过的报错，可以贴一下完整日志

BAMMBoo commented 2 months ago

(MinerU) PS C:\Users\qweyu> magic-pdf pdf-command --pdf "D:/Documents/123.pdf" --inside_model true 2024-07-26 17:55:10.101 | WARNING | magic_pdf.cli.magicpdf:get_model_json:310 - not found json D:/Documents/123.json existed 2024-07-26 17:55:10.103 | INFO | magic_pdf.cli.magicpdf:do_parse:91 - local output dir is D:/Documents/test\magic-pdf\123\auto Traceback (most recent call last): File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\libs\language.py", line 9, in detect_lang lang_upper = detect_language(text) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\fast_langdetect\ft_detect__init__.py", line 23, in detect_language lang_code = detect(sentence, low_memory=low_memory).get("lang").upper() File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\fast_langdetect\ft_detect\infer.py", line 81, in detect labels, scores = model.predict(text) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\fasttext\FastText.py", line 221, in predict text = check(text) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\fasttext\FastText.py", line 208, in check raise ValueError( ValueError: predict processes one line at a time (remove '\n')

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "D:\DailySW\Anaconda3\envs\MinerU\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\DailySW\Anaconda3\envs\MinerU\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\DailySW\Anaconda3\envs\MinerU\Scripts\magic-pdf.exe__main.py", line 7, in sys.exit(cli()) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1157, in call return self.main(*args, kwargs) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1078, in main rv = self.invoke(ctx) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 783, in invoke return callback(*args, **kwargs) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\cli\magicpdf.py", line 325, in pdf_command do_parse( File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\cli\magicpdf.py", line 106, in do_parse pipe.pipe_classify() File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\pipe\UNIPipe.py", line 25, in pipe_classify self.pdf_type = AbsPipe.classify(self.pdf_bytes) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\pipe\AbsPipe.py", line 63, in classify pdf_meta = pdf_meta_scan(pdf_bytes) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\filter\pdf_meta_scan.py", line 337, in pdf_meta_scan text_language = get_language(doc) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\filter\pdf_meta_scan.py", line 289, in get_language page_language = detect_lang(text_block) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\libs\language.py", line 12, in detect_lang lang_upper = detect_language(html_no_ctrl_chars) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\fast_langdetect\ft_detect__init__.py", line 23, in detect_language lang_code = detect(sentence, low_memory=low_memory).get("lang").upper() File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\fast_langdetect\ft_detect\infer.py", line 81, in detect labels, scores = model.predict(text) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\fasttext\FastText.py", line 228, in predict return labels, np.array(probs, copy=False) ValueError: Unable to avoid copy while creating an array as requested. If using np.array(obj, copy=False) replace it with np.asarray(obj) to allow a copy when needed (no behavior change in NumPy 1.x). For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

myhloli commented 2 months ago

(MinerU) PS C:\Users\qweyu> magic-pdf pdf-command --pdf "D:/Documents/123.pdf" --inside_model true

2024-07-26 17:55:10.101 | WARNING | magic_pdf.cli.magicpdf:get_model_json:310 - not found json D:/Documents/123.json existed

2024-07-26 17:55:10.103 | INFO | magic_pdf.cli.magicpdf:do_parse:91 - local output dir is D:/Documents/test\magic-pdf\123\auto

Traceback (most recent call last):

File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\libs\language.py", line 9, in detect_lang
lang_upper = detect_language(text)
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\fast_langdetect\ft_detect__init__.py", line 23, in detect_language
lang_code = detect(sentence, low_memory=low_memory).get("lang").upper()
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\fast_langdetect\ft_detect\infer.py", line 81, in detect
labels, scores = model.predict(text)
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\fasttext\FastText.py", line 221, in predict
text = check(text)
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\fasttext\FastText.py", line 208, in check
raise ValueError(
ValueError: predict processes one line at a time (remove '\n')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "D:\DailySW\Anaconda3\envs\MinerU\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\DailySW\Anaconda3\envs\MinerU\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\DailySW\Anaconda3\envs\MinerU\Scripts\magic-pdf.exe__main__.py", line 7, in
sys.exit(cli())
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1157, in call
return self.main(*args, **kwargs)
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1078, in main
rv = self.invoke(ctx)
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\cli\magicpdf.py", line 325, in pdf_command
do_parse(
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\cli\magicpdf.py", line 106, in do_parse
pipe.pipe_classify()
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\pipe\UNIPipe.py", line 25, in pipe_classify
self.pdf_type = AbsPipe.classify(self.pdf_bytes)
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\pipe\AbsPipe.py", line 63, in classify
pdf_meta = pdf_meta_scan(pdf_bytes)
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\filter\pdf_meta_scan.py", line 337, in pdf_meta_scan
text_language = get_language(doc)
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\filter\pdf_meta_scan.py", line 289, in get_language
page_language = detect_lang(text_block)
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\libs\language.py", line 12, in detect_lang
lang_upper = detect_language(html_no_ctrl_chars)
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\fast_langdetect\ft_detect__init__.py", line 23, in detect_language
lang_code = detect(sentence, low_memory=low_memory).get("lang").upper()
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\fast_langdetect\ft_detect\infer.py", line 81, in detect
labels, scores = model.predict(text)
File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\fasttext\FastText.py", line 228, in predict
return labels, np.array(probs, copy=False)
ValueError: Unable to avoid copy while creating an array as requested.

If using np.array(obj, copy=False) replace it with np.asarray(obj) to allow a copy when needed (no behavior change in NumPy 1.x).

For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

不支持numpy2.0，换1.26.4版本

sxk000 commented 2 months ago

1.26.4

你好，numpy1.26.4版本报同样的错误：

Traceback (most recent call last):
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/magic_pdf/libs/language.py", line 9, in detect_lang
    lang_upper = detect_language(text)
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/fast_langdetect/ft_detect/__init__.py", line 23, in detect_language
    lang_code = detect(sentence, low_memory=low_memory).get("lang").upper()
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/fast_langdetect/ft_detect/infer.py", line 81, in detect
    labels, scores = model.predict(text)
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/fasttext/FastText.py", line 225, in predict
    text = check(text)
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/fasttext/FastText.py", line 212, in check
    raise ValueError(
ValueError: predict processes one line at a time (remove '\n')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/apply/anaconda3/envs/p310pdf/bin/magic-pdf", line 8, in <module>
    sys.exit(cli())
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/magic_pdf/cli/magicpdf.py", line 325, in pdf_command
    do_parse(
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/magic_pdf/cli/magicpdf.py", line 106, in do_parse
    pipe.pipe_classify()
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py", line 25, in pipe_classify
    self.pdf_type = AbsPipe.classify(self.pdf_bytes)
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/magic_pdf/pipe/AbsPipe.py", line 63, in classify
    pdf_meta = pdf_meta_scan(pdf_bytes)
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/magic_pdf/filter/pdf_meta_scan.py", line 337, in pdf_meta_scan
    text_language = get_language(doc)
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/magic_pdf/filter/pdf_meta_scan.py", line 289, in get_language
    page_language = detect_lang(text_block)
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/magic_pdf/libs/language.py", line 12, in detect_lang
    lang_upper = detect_language(html_no_ctrl_chars)
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/fast_langdetect/ft_detect/__init__.py", line 23, in detect_language
    lang_code = detect(sentence, low_memory=low_memory).get("lang").upper()
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/fast_langdetect/ft_detect/infer.py", line 81, in detect
    labels, scores = model.predict(text)
  File "/apply/anaconda3/envs/p310pdf/lib/python3.10/site-packages/fasttext/FastText.py", line 232, in predict
    return labels, np.array(probs, copy=False)
ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

对应的环境报如下：


Package                   Version
------------------------- ------------------
absl-py                   2.1.0
aiohttp                   3.9.5
aiosignal                 1.3.1
albucore                  0.0.12
albumentations            1.4.12
altair                    5.3.0
annotated-types           0.7.0
antlr4-python3-runtime    4.9.3
anyio                     4.4.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
astor                     0.8.1
asttokens                 2.4.1
async-lru                 2.0.4
async-timeout             4.0.3
attrdict                  2.0.1
attrs                     23.2.0
Babel                     2.15.0
bce-python-sdk            0.9.17
beautifulsoup4            4.12.3
black                     24.4.2
bleach                    6.1.0
blinker                   1.8.2
boto3                     1.34.149
botocore                  1.34.149
braceexpand               0.1.7
Brotli                    1.1.0
cachetools                5.4.0
certifi                   2024.7.4
cffi                      1.16.0
charset-normalizer        3.3.2
click                     8.1.7
cloudpickle               3.0.0
colorlog                  6.8.2
comm                      0.2.2
contourpy                 1.2.1
cryptography              43.0.0
cssselect                 1.2.0
cssutils                  2.11.1
cycler                    0.12.1
Cython                    3.0.10
datasets                  2.20.0
debugpy                   1.8.2
decorator                 5.1.1
defusedxml                0.7.1
detectron2                0.6
dill                      0.3.8
et-xmlfile                1.1.0
eva-decord                0.6.1
eval_type_backport        0.2.0
evaluate                  0.4.2
exceptiongroup            1.2.2
executing                 2.0.1
fairscale                 0.4.13
fast-langdetect           0.2.1
fastjsonschema            2.20.0
fasttext-wheel            0.9.2
filelock                  3.15.4
fire                      0.6.0
Flask                     3.0.3
flask-babel               4.0.0
fonttools                 4.53.1
fqdn                      1.5.1
frozenlist                1.4.1
fsspec                    2024.5.0
ftfy                      6.2.0
future                    1.0.0
fvcore                    0.1.5.post20221221
gitdb                     4.0.11
GitPython                 3.1.43
grpcio                    1.65.1
h11                       0.14.0
h2                        4.1.0
hpack                     4.0.0
httpcore                  1.0.5
httpx                     0.27.0
huggingface-hub           0.24.2
hydra-core                1.3.2
hyperframe                6.0.1
idna                      3.7
imageio                   2.34.2
imgaug                    0.4.0
iopath                    0.1.9
ipykernel                 6.29.5
ipython                   8.26.0
isoduration               20.11.0
itsdangerous              2.2.0
jedi                      0.19.1
Jinja2                    3.1.4
jmespath                  1.0.1
joblib                    1.4.2
json5                     0.9.25
jsonpointer               3.0.0
jsonschema                4.23.0
jsonschema-specifications 2023.12.1
jupyter_client            8.6.2
jupyter_core              5.7.2
jupyter-events            0.10.0
jupyter-lsp               2.2.5
jupyter_server            2.14.2
jupyter_server_terminals  0.5.3
jupyterlab                4.2.4
jupyterlab_pygments       0.3.0
jupyterlab_server         2.27.3
kiwisolver                1.4.5
langdetect                1.0.9
lazy_loader               0.4
lmdb                      1.5.1
loguru                    0.7.2
lxml                      5.2.2
magic-pdf                 0.6.1
Markdown                  3.6
markdown-it-py            3.0.0
MarkupSafe                2.1.5
matplotlib                3.9.1
matplotlib-inline         0.1.7
mdurl                     0.1.2
mistune                   3.0.2
more-itertools            10.3.0
mpmath                    1.3.0
multidict                 6.0.5
multiprocess              0.70.16
mypy-extensions           1.0.0
nbclient                  0.10.0
nbconvert                 7.16.4
nbformat                  5.10.4
nest-asyncio              1.6.0
networkx                  3.3
nltk                      3.8.1
notebook_shim             0.2.4
numpy                     1.26.4
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         8.9.2.26
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-nccl-cu12          2.20.5
nvidia-nvjitlink-cu12     12.5.82
nvidia-nvtx-cu12          12.1.105
omegaconf                 2.3.0
opencv-contrib-python     4.6.0.66
opencv-python             4.6.0.66
opencv-python-headless    4.10.0.84
openpyxl                  3.1.5
opt-einsum                3.3.0
overrides                 7.7.0
packaging                 24.1
paddleocr                 2.7.3
paddlepaddle              2.6.1
pandas                    2.2.2
pandocfilters             1.5.1
parso                     0.8.4
pathspec                  0.12.1
pdf2docx                  0.5.8
pdf2image                 1.17.0
pdfminer.six              20231228
pexpect                   4.9.0
pillow                    10.4.0
pip                       24.0
platformdirs              4.2.2
portalocker               2.10.1
premailer                 3.10.0
prometheus_client         0.20.0
prompt_toolkit            3.0.47
protobuf                  4.25.3
psutil                    6.0.0
ptyprocess                0.7.0
pure_eval                 0.2.3
py-cpuinfo                9.0.0
pyarrow                   17.0.0
pyarrow-hotfix            0.6
pybind11                  2.13.1
pyclipper                 1.3.0.post5
pycocotools               2.0.8
pycparser                 2.22
pycryptodome              3.20.0
pydantic                  2.8.2
pydantic_core             2.20.1
pydeck                    0.9.1
Pygments                  2.18.0
PyMuPDF                   1.24.9
PyMuPDFb                  1.24.9
pyparsing                 3.1.2
pypdfium2                 4.30.0
python-dateutil           2.9.0.post0
python-docx               1.1.2
python-json-logger        2.0.7
pytz                      2024.1
PyYAML                    6.0.1
pyzmq                     26.0.3
rapidfuzz                 3.9.4
rarfile                   4.2
referencing               0.35.1
regex                     2024.7.24
requests                  2.32.3
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rich                      13.7.1
robust-downloader         0.0.2
rpds-py                   0.19.1
s3transfer                0.10.2
safetensors               0.4.3
scikit-image              0.24.0
scikit-learn              1.5.1
scipy                     1.14.0
seaborn                   0.13.2
Send2Trash                1.8.3
setuptools                71.0.4
shapely                   2.0.5
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.1
soupsieve                 2.5
stack-data                0.6.3
streamlit                 1.37.0
streamlit-drawable-canvas 0.9.3
sympy                     1.13.1
tabulate                  0.9.0
tenacity                  8.5.0
tensorboard               2.17.0
tensorboard-data-server   0.7.2
termcolor                 2.4.0
terminado                 0.18.1
threadpoolctl             3.5.0
tifffile                  2024.7.24
timm                      0.9.16
tinycss2                  1.3.0
tokenizers                0.19.1
toml                      0.10.2
tomli                     2.0.1
toolz                     0.12.1
torch                     2.3.1
torchtext                 0.18.0
torchvision               0.18.1
tornado                   6.4.1
tqdm                      4.66.4
traitlets                 5.14.3
transformers              4.40.0
triton                    2.3.1
types-python-dateutil     2.9.0.20240316
typing_extensions         4.12.2
tzdata                    2024.1
ultralytics               8.2.68
ultralytics-thop          2.0.0
unimernet                 0.1.1
uri-template              1.3.0
urllib3                   2.2.2
visualdl                  2.5.3
Wand                      0.6.13
watchdog                  4.0.1
wcwidth                   0.2.13
webcolors                 24.6.0
webdataset                0.2.86
webencodings              0.5.1
websocket-client          1.8.0
Werkzeug                  3.0.3
wheel                     0.43.0
wordninja                 2.0.0
xxhash                    3.4.1
yacs                      0.1.8
yarl                      1.9.4

sxk000 commented 2 months ago

这里写的fast-langdetect==0.2.0 报错：

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
magic-pdf 0.6.1 requires fast-langdetect>=0.2.1, but you have fast-langdetect 0.2.0 which is incompatible.

安装的环境版本到底什么怎么样的呢？

方便把你的requirements.txt文件发出来吗？

谢谢！

BAMMBoo commented 2 months ago

numpy版本改为1.26.4后，提示找不到指定程序

magic-pdf pdf-command --pdf "D:/Documents/123.pdf" --inside_model true 2024-07-29 20:06:51.365 | WARNING | magic_pdf.cli.magicpdf:get_model_json:310 - not found json D:/Documents/123.json existed 2024-07-29 20:06:51.367 | INFO | magic_pdf.cli.magicpdf:do_parse:91 - local output dir is D:/Documents/test\magic-pdf\123\auto 2024-07-29 20:07:07.841 | INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 8512, cid_chars_radio: 0.0 Traceback (most recent call last): File "D:\DailySW\Anaconda3\envs\MinerU\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\DailySW\Anaconda3\envs\MinerU\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\DailySW\Anaconda3\envs\MinerU\Scripts\magic-pdf.exe__main.py", line 7, in sys.exit(cli()) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1157, in call return self.main(*args, kwargs) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1078, in main rv = self.invoke(ctx) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\click\core.py", line 783, in invoke return callback(*args, *kwargs) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\cli\magicpdf.py", line 325, in pdf_command do_parse( File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\cli\magicpdf.py", line 111, in do_parse pipe.pipe_analyze() File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\pipe\UNIPipe.py", line 29, in pipe_analyze self.model_list = doc_analyze(self.pdf_bytes, ocr=False) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\model\doc_analyze_by_custom_model.py", line 65, in doc_analyze from magic_pdf.model.pdf_extract_kit import CustomPEKModel File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\magic_pdf\model\pdf_extract_kit.py", line 16, in from unimernet.common.config import Config File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\unimernet__init__.py", line 18, in from unimernet.tasks import File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\unimernet\tasks__init.py", line 10, in from unimernet.tasks.unimernet_train import UniMERNet_Train File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\unimernet\tasks\unimernet_train.py", line 11, in from torchtext.data import metrics File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\torchtext__init__.py", line 18, in from torchtext import _extension # noqa: F401 File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\torchtext_extension.py", line 64, in _init_extension() File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\torchtext_extension.py", line 58, in _init_extension _load_lib("libtorchtext") File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\torchtext_extension.py", line 50, in _load_lib torch.ops.load_library(path) File "D:\DailySW\Anaconda3\envs\MinerU\lib\site-packages\torch_ops.py", line 1295, in load_library ctypes.CDLL(path) File "D:\DailySW\Anaconda3\envs\MinerU\lib\ctypes\init.py", line 374, in init__ self._handle = _dlopen(self._name, mode) OSError: [WinError 127] 找不到指定的程序。

Pandas886 commented 2 months ago

@congweitao @Pandas886 跑了下Demo，写了篇文档记录： https://zhuanlan.zhihu.com/p/709402502 或者有一点帮助

转自上面兄弟分享的内容：

创建 Conda 环境并安装包

conda create -n py310torch python=3.10
conda activate py310torch
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu121

MinerU

安装依赖

pip install magic-pdf[full-cpu]
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/

下载权重并配置

安装依赖
```
pip install -U "huggingface_hub[cli]"
```

设置环境变量

Linux

export HF_ENDPOINT=https://hf-mirror.com

Windows

$env:HF_ENDPOINT = "https://hf-mirror.com"

下载权重

huggingface-cli download wanderkid/PDF-Extract-Kit

配置 magic-pdf.json

在用户主目录下创建文件 magic-pdf.json，并按如下内容配置：

{
 "bucket_info":{
     "bucket-name-1":["ak", "sk", "endpoint"],
     "bucket-name-2":["ak", "sk", "endpoint"]
 },
 "temp-output-dir":"tmp",
 "models-dir":"D:/16-LLM-Cache/huggingface/hub/models--wanderkid--PDF-Extract-Kit/snapshots/bbfd601d3dab736bf366e2119ec0bbe0f4e6f012/models",
 "device-mode":"cuda"
}

注意：根据实际需求设置 device-mode 为 cuda 或 cpu。

使用

命令行使用

查看帮助信息
```
magic-pdf pdf-command --help
```

调用命令

magic-pdf pdf-command --pdf "assets/***手册.pdf" --inside_model true --model_mode full

性能和优缺点

性能速度：运行命令耗时 47 秒
优点：文字解析效果优于 marker；能识别 PDF 中包含文字的图片
缺点：未识别文档中的表格；部分文字顺序错误

补充信息

表格识别：已有人提议，预计一个月内将有更新
板式识别：二级标题可能被误识别

通过接口（Python代码）调用

参考代码：demo.py

from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
import magic_pdf.model as model_config

model_config.__use_inside_model__ = True

with open('assets\***手册.pdf', "rb") as pdf_file:
    pdf_bytes = pdf_file.read()

local_image_dir = "mineru_images"

image_writer = DiskReaderWriter(local_image_dir)

model_json = []  # 使用内置模型解析
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(local_image_dir, drop_mode="none")

运行耗时：1 分 12 秒