opendatalab / PDF-Extract-Kit

A Comprehensive Toolkit for High-Quality PDF Content Extraction
https://pdf-extract-kit.readthedocs.io/zh-cn/latest/index.html
GNU Affero General Public License v3.0
5.27k stars 357 forks source link

双栏文档解析结果中阅读顺序错误,并且有部分内容遗失。可以优化一下阅读顺序吗? #96

Open Maple0709 opened 2 months ago

drunkpig commented 2 months ago

@Maple0709 pls provide your pdf to let me check this issue.

Maple0709 commented 2 months ago
说明书

文件内容如图中所示,但是解析之后的阅读顺序是儿童乘车的内容->宠物乘车的内容->老人乘车的内容

Maple0709 commented 2 months ago

只有左边栏的内容与右边栏的内容一样多或者右边栏内容更多的时候,阅读顺序才是正确的。否则会出现上述的情况

Joker1212 commented 2 months ago

I recently submitted a PR. Is there anyone who can take a look? I wrote a regular sorting function as a reference to PaddleStructure. The LayoutReader's performance was not good in my experiments, so I didn't use it.

@Maple0709 pls provide your pdf to let me check this issue.