opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.3k stars 847 forks source link

缺失识别程序代码块 #225

Open yiyibooks opened 1 month ago

yiyibooks commented 1 month ago

Description of the bug | 错误描述

论文中经常会包含代码,当前 MinerU 将这些代码块识别为普通文本,并放置在同一行,如下图

image

识别为

As of OpenDevin v0.6, we support the following list of skills. Please refer to the source code for the most up-to-date list of skills: https://github.com/OpenDevin/OpenDevin/blob/main/opendevin/ runtime/plugins/agent_skills/agentskills.py

def open_file (path: str, line_number: Optional[int] $=$ None ) $->$ None : """ Opens the file at the given path in the editor. If line_number is $\hookrightarrow$ provided, the window will be moved to include that line. → Args: path: str: The path to the file to open. line_number: Optional[int]: The line number to move to. """ pass

How to reproduce the bug | 如何复现

论文样例 https://arxiv.org/pdf/2407.16741

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

drunkpig commented 1 month ago

@yiyibooks Thanks for your enthusiasm. As you see, code blocks, lists, and content list have not yet been recognized in the layout recognition model. The development of this feature is in our plans.