opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.19k stars 835 forks source link

feat<table model>: add tablemaster with paddleocr to detect and recognize table #493

Closed papayalove closed 2 weeks ago

papayalove commented 2 weeks ago

Motivation

To enhance the ability of table detection and recognition, we adopt a new table structure model——Tablemaster along with paddleocr to deal with tables. TableMaster is way faster than StructEqTable.

Modification

We add an option for table convertion. Unlike the StructEqTable, the new table model will ouput result in html codes.

Use cases (Optional)

Change the value of "model" in magic-pdf.json to switch to TableMaster.

{
  // other config
  "models-dir": "D:/models",
  "table-config": {
        "model": "TableMaster", // Another option of this value is 'struct_eqtable'
        "is_table_recog_enable": false, // Table recognition is disabled by default, modify this value to enable it
        "max_time": 400
    }
}

Checklist

Before PR:

After PR: