opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
https://opendatalab.com/OpenSourceTools?tool=extract
GNU Affero General Public License v3.0
17.96k stars 1.29k forks source link

There are reading order problems in this published version #1038

Closed zahrarsl closed 4 hours ago

zahrarsl commented 1 day ago

Description of the bug | 错误描述

I converted a number of pdf files to markdown files with this method, but there were some errors in all these files. The order of the text is not observed. Like the photos below.

Screenshot 2024-11-20 130720 Screenshot 2024-11-20 130706

How to reproduce the bug | 如何复现

I expected the text inside the markdown file to be exactly the same as the pdf file.

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cuda

myhloli commented 1 day ago

Can you upload this pdf file?We will fix it as soon as possible.

zahrarsl commented 1 day ago

10.1016@j.is.2016.03.004.pdf This file is sent as an example. These cases have been observed in several places. Is it because of the new version? How accurate is the model?

myhloli commented 1 day ago

We have noticed issues with reading order and character loss in areas dense with formulas in text-based PDFs. We will thoroughly fix this problem in the next version.

zahrarsl commented 1 day ago

Thank you very much.

myhloli commented 1 day ago

image We are pleased that our new code performed well in this sample. In the coming days, we will release a new version to thoroughly address this issue.

zahrarsl commented 1 day ago

You are really great.

zahrarsl commented 1 day ago

10.1016@j.aci.2014.05.001_origin.pdf I'm sorry. In this file, there is a series of algorithm sections, some of which recognize the image and some others do not and read as text. I suggest you consider this in the new version. thanks

myhloli commented 4 hours ago

fixed