WJLNTU commented 1 month ago

您好，请问可以提供评测的相关数据集及标签类别吗？比如中文论文场景都评测了哪些标签类别？与中文相关开源的模型标签是否一致？想知道怎么与相关开源模型进行对比的？

wangbinDL commented 1 month ago

README中有针对布局检测的类别描述：

 0: 'title',              # 标题
 1: 'plain text',         # 文本
 2: 'abandon',            # 包括页眉页脚页码和页面注释
 3: 'figure',             # 图片
 4: 'figure_caption',     # 图片描述
 5: 'table',              # 表格
 6: 'table_caption',      # 表格描述
 7: 'table_footnote',     # 表格注释
 8: 'isolate_formula',    # 行间公式（这个是layout的行间公式，优先级低于14）
 9: 'formula_caption',    # 行间公式的标号

 13: 'inline_formula',    # 行内公式
 14: 'isolated_formula',  # 行间公式
 15: 'ocr_text'             # ocr识别结果

中英文标签一致，我们采集多样性的中英文文档，并进行详细标注，保证标注规范的一致性和准确性；
评测数据因数据版权问题，只用于内容评测，暂无法对外开放；
我们在多样性的文档评测集上，使用其他开源模型进行推理，对齐格式后进行评测。当前大多数开源模型存在的问题是多样性不够，在单一类别上效果好，但换一下场景就比较差。

WJLNTU commented 1 month ago

好的，确实，我是个学生哈，也在做类似的科研。我使用其他模型其实效果不错，但是特定域肯定在特定域上效果好，对比指标是在通用场景上肯定不行，严谨来看，如果有具体评价数据集和评测脚本就好了，比如说，对比维度（模型参数、推理速度、模型训练标签是否统一等），可以私聊发我不，感激不尽 : )

ouyanglinke commented 1 month ago

评测依赖于mmeval的COCODetection，只需要将模型结果和GT转换成相应格式输入评测即可。
评测数据由于版权问题尚不明确因此暂不公开，后续如果版权问题得到解决会考虑公开，请持续关注，谢谢！
模型的参数我们没有做特殊设置，评测时候用每个模型repo里提供的代码直接进行推理，详情请参考：
- Surya
- 360LayoutAnalysis
类别标签的部分，我们在验证的时候进行了映射对齐，具体如下：（以下内容我们后续也将更新到本repo的Validation部分，感谢提问！）

Surya
```
# 参与验证的类别
label_classes = ["title", "plain text", "abandon", "figure", "caption", "table", "isolate_formula"] 
```

GT的类别映射 (原本的类别与本repo微调的LayoutLmv3-SFT对齐)

anno_class_change_dict = { 'formula_caption': 'caption', 'table_caption': 'caption', 'table_footnote': 'plain text' }

Surya的类别映射

class_dict = { 'Caption': 'caption', 'Section-header' : 'title', 'Title': 'title', 'Figure': 'figure', 'Picture': 'figure', 'Footnote': 'abandon', 'Page-footer': 'abandon', 'Page-header': 'abandon', 'Table': 'table', 'Text': 'plain text', 'List-item': 'plain text', 'Formula': 'isolate_formula', }


# 360LayoutAnalysis
## For Paper

参与验证的类别

label_classes = ["title", "plain text", "abandon", "figure", "figure_caption", "table", "table_caption", "isolate_formula"]

GT的类别映射表

anno_class_change_dict = { 'formula_caption': 'plain text', 'table_footnote': 'plain text' }

360LayoutAnalysis的类别映射

class_change_dict = { 'Text': 'plain text',
'Title': 'title', 'Figure': 'figure', 'Figure caption': 'figure_caption',
'Table': 'table',
'Table caption': 'table_caption',
'Header': 'abandon', 'Footer': 'abandon',
'Reference': 'plain text',
'Equation': 'isolate_formula', 'Toc': 'plain text'
}


## For Report

参与验证的类别

label_classes = ["title", "plain text", "abandon", "figure", "figure_caption", "table", "table_caption"]

GT的类别映射表

anno_class_change_dict = { 'formula_caption': 'plain text', 'table_footnote': 'plain text', 'isolate_formula': 'plain text', }

360LayoutAnalysis的类别映射

class_change_dict = { 'Text': 'plain text',
'Title': 'title', 'Figure': 'figure', 'Figure caption': 'figure_caption',
'Table': 'table',
'Table caption': 'table_caption',
'Header': 'abandon', 'Footer': 'abandon',
'Reference': 'plain text',
'Equation': 'isolate_formula', 'Toc': 'plain text'
}

ouyanglinke commented 1 month ago

验证集和验证代码预计在八月底发布，请持续关注，谢谢！

opendatalab / PDF-Extract-Kit

关于评测及评测数据集 #19

Surya

GT的类别映射 (原本的类别与本repo微调的LayoutLmv3-SFT对齐)

Surya的类别映射

参与验证的类别

GT的类别映射表

360LayoutAnalysis的类别映射

参与验证的类别

GT的类别映射表

360LayoutAnalysis的类别映射