opendatalab / PDF-Extract-Kit

A Comprehensive Toolkit for High-Quality PDF Content Extraction
https://pdf-extract-kit.readthedocs.io/zh-cn/latest/index.html
GNU Affero General Public License v3.0
5.27k stars 357 forks source link

一个优化想法 #55

Open Joker1212 opened 3 months ago

Joker1212 commented 3 months ago

pdf文档可能出现表格和段落跨页的情况,能考虑用两页进行ocr识别,每次都移动一页,最后做去重的办法来提高布局检测和表格提取的准确度嘛?

myhloli commented 3 months ago

it's a nice idea,but in this way,header and footer is harder to detection.

Joker1212 commented 3 months ago

是因为前一页的页脚和下一页的页眉会有重叠导致影响了合并的效果嘛?也许先找到每一页文字的边界坐标,再进行合并是可行的,找到边界的工作不需要特别强大的布局ocr模型,只需要能快速检测就可以

xsank commented 2 months ago

it's a nice idea,but in this way,header and footer is harder to detection.

每页的高度是可以获取的,其实可以检测,有点tricky

Joker1212 commented 2 months ago

我自己实践了一下,发现确实对于后续分段好很多,因为标题其实就是最好的结构化分段标识,先排序了,再跨列合并了,再跨页合并,即使把本来无关的语段拼接在一起,也会识别为两段,其实不会劣化布局识别效果

image image image