Open wanghaisheng opened 6 years ago
www.icst.pku.edu.cn/cpdp/data/marmot_data.htm Dataset for table recognition In total, 2000 pages in PDF format were collected and the corresponding ground-truths were extracted utilizing our semi-automatic ground-truthing tool "Marmot". The dataset is composed of Chinese and English pages at the proportion of about 1:1.
The Chinese pages were selected from over 120 e-Books with diverse subject areas provided by Founder Apabi library, and no more than 15 pages were selected from each book.
The English pages were crawled from Citeseer website.
The pages show a great variety in language type, page layout, and table styles. Among them, over 1500 conference and journal papers were crawled, covering various fields, spanning from the year 1970, to latest 2011 publications. The e-Book pages are mostly in one-column layout, while the English pages are mixed with both one-column and two-column layouts.
Open Images数据集&挑战赛:
https://github.com/cs-chan/Total-Text-Dataset
Total Text Dataset - ICDAR 2017. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.
数据集CTW: https://ctwdataset.github.io/ n this paper we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters annotated by experts in over 30 thousand street view images. This is a challenging dataset with good diversity. It contains planar text, raised text, text in cities, text in rural areas, text under poor illumination, distant text, partially occluded text, etc. For each character in the dataset, the annotation includes its underlying character, its bounding box, and 6 attributes. The attributes indicate whether it has complex background, whether it is raised, whether it is handwritten or printed, etc.
32,285 high resolution images
1,018,402 character instances
3,850 character categories
6 kinds of attributes
http://rrc.cvc.uab.es/?com=introduction "Robust Reading" refers to the research area dealing with the interpretation of written communication in unconstrained settings. Typically Robust Reading is linked to the detection and recognition of textual information in scene images, but in the wider sense it refers to techniques and methodologies that have been developed specifically for text containers other than scanned paper documents, and include born-digital images and videos to mention a few.
Robust Reading is at the meeting point between camera based document analysis and scene interpretation, and serves as common ground between the document analysis community and the wider computer vision community.
The ICDAR Robust Reading Competition has been held five times [1-5], in 2003, 2005, 2011, 2013 and 2015. The competition is organized around challenges that represent specific application domains for robust reading. Challenges are selected to cover a wide range of real-world situations. Each challenge is set up around different tasks.
ICDAR2017
The Text Recognition Algorithm Independent Evaluation (TRAIT) https://nvlpubs.nist.gov/nistpubs/ir/2017/NIST.IR.8199.pdf
链接: https://pan.baidu.com/s/12Wstdz_u8iwr7NEJGQtnZg 密码: 7p2m HWDB2.2手写体VOC,需要的同志自取
in marmot data set the table BBOX are not matching with original images
我想问下,有没有中文或者英文的 文本行的数据集?类似caffe-ocr人工合成的那种。
@cloudfool 大家都是结合自己实际处理的场景 套用现有的一些生成工具来造的 真实场景的话 英文的还挺多 中文的比较少 但可以用其他一些来造(比如你处理的是论文类型的文档)
请问英文的文本行数据集有哪些开源的?我找了很多,都是那种单词级的(比如ICDAR),我想要的是句子级别的。
@cloudfool 我上面列的你都看过了不~ https://github.com/NVlabs/ocroseg/tree/master/testdata 句子级别 你要什么样的句子 古登堡计划的电子书 小说诗歌啥的txt直接可以造啊 用numpy这些
@wanghaisheng 你好,给你github上展示的163邮箱发了一个邮件,需要你的帮助兄弟!
@mttbx 我找不到原始文件了。
链接: https://pan.baidu.com/s/12Wstdz_u8iwr7NEJGQtnZg 密码: 7p2m HWDB2.2手写体VOC,需要的同志自取
兄弟,链接过期了!
@LinnaWang76 sorry 我已经忘记文件名称,无法在pan中找到文件对其重新进行分享
in marmot data set the table BBOX are not matching with original images
I am faced with the same issue. Do you have any idea about it later?
医疗类病历
*从互助平台收集的用于评估手机拍照类文本定位识别的数据集 https://github.com/wanghaisheng/huzhucases