用于benchmark检测的数据集

wanghaisheng commented 6 years ago

医疗类病历

*从互助平台收集的用于评估手机拍照类文本定位识别的数据集 https://github.com/wanghaisheng/huzhucases

wanghaisheng commented 6 years ago

www.icst.pku.edu.cn/cpdp/data/marmot_data.htm Dataset for table recognition In total, 2000 pages in PDF format were collected and the corresponding ground-truths were extracted utilizing our semi-automatic ground-truthing tool "Marmot". The dataset is composed of Chinese and English pages at the proportion of about 1:1.

The Chinese pages were selected from over 120 e-Books with diverse subject areas provided by Founder Apabi library, and no more than 15 pages were selected from each book.
The English pages were crawled from Citeseer website.

The pages show a great variety in language type, page layout, and table styles. Among them, over 1500 conference and journal papers were crawled, covering various fields, spanning from the year 1970, to latest 2011 publications. The e-Book pages are mostly in one-column layout, while the English pages are mixed with both one-column and two-column layouts.

wanghaisheng commented 6 years ago

Open Images数据集&挑战赛：

https://storage.googleapis.com/openimages/web/index.html

wanghaisheng commented 6 years ago

https://github.com/cs-chan/Total-Text-Dataset

Total Text Dataset - ICDAR 2017. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

wanghaisheng commented 6 years ago

数据集CTW: https://ctwdataset.github.io/ n this paper we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters annotated by experts in over 30 thousand street view images. This is a challenging dataset with good diversity. It contains planar text, raised text, text in cities, text in rural areas, text under poor illumination, distant text, partially occluded text, etc. For each character in the dataset, the annotation includes its underlying character, its bounding box, and 6 attributes. The attributes indicate whether it has complex background, whether it is raised, whether it is handwritten or printed, etc.

32,285 high resolution images
1,018,402 character instances
3,850 character categories
6 kinds of attributes

wanghaisheng commented 6 years ago

http://rrc.cvc.uab.es/?com=introduction "Robust Reading" refers to the research area dealing with the interpretation of written communication in unconstrained settings. Typically Robust Reading is linked to the detection and recognition of textual information in scene images, but in the wider sense it refers to techniques and methodologies that have been developed specifically for text containers other than scanned paper documents, and include born-digital images and videos to mention a few.

Robust Reading is at the meeting point between camera based document analysis and scene interpretation, and serves as common ground between the document analysis community and the wider computer vision community.

The ICDAR Robust Reading Competition has been held five times [1-5], in 2003, 2005, 2011, 2013 and 2015. The competition is organized around challenges that represent specific application domains for robust reading. Challenges are selected to cover a wide range of real-world situations. Each challenge is set up around different tasks.

ICDAR2017