how have you processed the blocks after finding out the layout order?

opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具，将PDF转换成Markdown和JSON格式。

https://opendatalab.com/OpenSourceTools?tool=extract

GNU Affero General Public License v3.0

17.66k stars 1.28k forks source link

how have you processed the blocks after finding out the layout order? #989

Open vikas-singh16 opened 2 days ago

vikas-singh16 commented 2 days ago

Hi, first of all excellent work done guys. Very helpful for what u have done here guys.

I want to understand, how were you able to organise the blocks (i.e text, title, table, etc) after finding out there order in the page. If possible can u explain in short and guide me to that particular code.

Thank you

myhloli commented 1 day ago

If you have already obtained the correct reading order, then you just need to connect all the blocks in sequence. For the specific code, you can refer to: https://github.com/opendatalab/MinerU/blob/master/magic_pdf/pdf_parse_union_core_v2.py

vikas-singh16 commented 13 hours ago

Thank you for the swift response.