opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://mineru.readthedocs.io/
GNU Affero General Public License v3.0
13.9k stars 1.04k forks source link

ppt格式的文档能否支持解析? #297

Open chuanbei888 opened 3 months ago

myhloli commented 3 months ago

https://github.com/opendatalab/magic-doc

it will work at ppt/pptx files

drunkpig commented 3 months ago

https://github.com/opendatalab/magic-doc

it will work at ppt/pptx files

If you want hight quality extract result you should convert ppt to pdf, then use minerU. if you want fast extract speed but do not care extract quality you should choose maic-doc

drunkpig commented 3 months ago

@chuanbei888 try to convert ppt to pdf with libreoffice

drunkpig commented 3 months ago

libreoffice --invisible --convert-to docx:'MS Word 2007 XML' /path/to/mydoc.doc --outdir /output/dir

chuanbei888 commented 3 months ago

https://github.com/opendatalab/magic-doc it will work at ppt/pptx files

If you want hight quality extract result you should convert ppt to pdf, then use minerU. if you want fast extract speed but do not care extract quality you should choose maic-doc

Okay, I will have a try.

thorory commented 3 months ago

请教一下,对于ppt和docx转markdown的方案选择上,转成pdf再用magic-pdf 和 直接用magic-doc 这两个方案哪个效果更佳?

先转pdf再转md,会不会导致部分文字的识别 不如直接读取的好?

myhloli commented 3 months ago

请教一下,对于ppt和docx转markdown的方案选择上,转成pdf再用magic-pdf 和 直接用magic-doc 这两个方案哪个效果更佳?

先转pdf再转md,会不会导致部分文字的识别 不如直接读取的好?

magic-doc文本提取能力强,速度更快,但是最终输出是不包含任何图片的。 转pdf之后使用magic-pdf提取,可以实现较好的图片排版效果,缺点是速度较慢。

zouhuigang commented 3 months ago

docx转pdf有没有批量的工具

drunkpig commented 3 months ago

@zouhuigang liberoffice

Victor94-king commented 2 weeks ago

Any tool you recommend that convert ppt to pdf?