opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.54k stars 865 forks source link

能否支持三栏布局的pdf文档解析 #615

Open guoguo0646 opened 1 week ago

guoguo0646 commented 1 week ago

目前版本(0.8.1)解析的pdf文档,如果是三栏布局,解析结果会存在段落错乱的问题, image

部分运行日志: 2024-09-14 10:20:35.811 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 2, mfr time: 0.33 2024-09-14 10:20:36.460 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 0.65 2024-09-14 10:20:36.461 | INFO | magic_pdf.model.pdf_extract_kit:call:407 - table cost: 0.0 2024-09-14 10:20:36.461 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:136 - doc analyze cost: 5.489260673522949 2024-09-14 10:20:36.909 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:242 - page_id: 0, last_page_cost_time: 0.0 2024-09-14 10:20:36.970 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:179 - skip this page, page_id: 0, reason: complicated_layout 2024-09-14 10:20:36.971 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:242 - page_id: 1, last_page_cost_time: 0.06 2024-09-14 10:20:37.030 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:179 - skip this page, page_id: 1, reason: complicated_layout 2024-09-14 10:20:37.031 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:242 - page_id: 2, last_page_cost_time: 0.06 2024-09-14 10:20:37.051 | WARNING | magic_pdf.pdf_parse_union_core:parse_page_core:186 - skip this page, page_id: 2, reason: too_many_layout_columns 2024-09-14 10:20:37.061 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:145 - 发现了列表,列表行数:[(12, 16)], [[12]] 2024-09-14 10:20:37.061 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:158 - 列表行的第12到第16行是列表 2024-09-14 10:20:37.075 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:145 - 发现了列表,列表行数:[(0, 1)], [[0]] 2024-09-14 10:20:37.075 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:158 - 列表行的第0到第1行是列表 2024-09-14 10:20:37.076 | INFO | magic_pdf.para.para_split_v2:__detect_list_lines:145 - 发现了列表,列表行数:[(16, 31)], [[16, 19, 22, 26, 29]] 2024-09-14 10:20:37.076 | INFO | magic_pdf.para.para_split_v2:detect_list_lines:158 - 列表行的第16到第31行是列表 2024-09-14 10:20:37.076 | INFO | magic_pdf.para.para_split_v2:para_split:766 - 连接了第1页和第2页的段落

myhloli commented 1 week ago

目前在做一些布局排序的优化,预计下个大版本更新可以对超过两栏的布局正确排序