wanghaisheng / awesome-ocr

A curated list of promising OCR resources
http://wanghaisheng.github.io/ocr-arxiv-daily/
MIT License
1.66k stars 351 forks source link

用于benchmark检测的数据集 #93

Open wanghaisheng opened 6 years ago

wanghaisheng commented 6 years ago

医疗类病历

*从互助平台收集的用于评估手机拍照类文本定位识别的数据集 https://github.com/wanghaisheng/huzhucases

wanghaisheng commented 6 years ago

www.icst.pku.edu.cn/cpdp/data/marmot_data.htm Dataset for table recognition In total, 2000 pages in PDF format were collected and the corresponding ground-truths were extracted utilizing our semi-automatic ground-truthing tool "Marmot". The dataset is composed of Chinese and English pages at the proportion of about 1:1.

The Chinese pages were selected from over 120 e-Books with diverse subject areas provided by Founder Apabi library, and no more than 15 pages were selected from each book.
The English pages were crawled from Citeseer website.

The pages show a great variety in language type, page layout, and table styles. Among them, over 1500 conference and journal papers were crawled, covering various fields, spanning from the year 1970, to latest 2011 publications. The e-Book pages are mostly in one-column layout, while the English pages are mixed with both one-column and two-column layouts.

wanghaisheng commented 6 years ago

Open Images数据集&挑战赛:

https://storage.googleapis.com/openimages/web/index.html

wanghaisheng commented 6 years ago

https://github.com/cs-chan/Total-Text-Dataset

Total Text Dataset - ICDAR 2017. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

wanghaisheng commented 6 years ago

数据集CTW: https://ctwdataset.github.io/ n this paper we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters annotated by experts in over 30 thousand street view images. This is a challenging dataset with good diversity. It contains planar text, raised text, text in cities, text in rural areas, text under poor illumination, distant text, partially occluded text, etc. For each character in the dataset, the annotation includes its underlying character, its bounding box, and 6 attributes. The attributes indicate whether it has complex background, whether it is raised, whether it is handwritten or printed, etc.

32,285 high resolution images
1,018,402 character instances
3,850 character categories
6 kinds of attributes
wanghaisheng commented 6 years ago

http://rrc.cvc.uab.es/?com=introduction "Robust Reading" refers to the research area dealing with the interpretation of written communication in unconstrained settings. Typically Robust Reading is linked to the detection and recognition of textual information in scene images, but in the wider sense it refers to techniques and methodologies that have been developed specifically for text containers other than scanned paper documents, and include born-digital images and videos to mention a few.

Robust Reading is at the meeting point between camera based document analysis and scene interpretation, and serves as common ground between the document analysis community and the wider computer vision community.

The ICDAR Robust Reading Competition has been held five times [1-5], in 2003, 2005, 2011, 2013 and 2015. The competition is organized around challenges that represent specific application domains for robust reading. Challenges are selected to cover a wide range of real-world situations. Each challenge is set up around different tasks.

ICDAR2017

wanghaisheng commented 6 years ago

The Text Recognition Algorithm Independent Evaluation (TRAIT) https://nvlpubs.nist.gov/nistpubs/ir/2017/NIST.IR.8199.pdf

default
wanghaisheng commented 6 years ago

链接: https://pan.baidu.com/s/12Wstdz_u8iwr7NEJGQtnZg 密码: 7p2m HWDB2.2手写体VOC,需要的同志自取

mvprasad58 commented 5 years ago

in marmot data set the table BBOX are not matching with original images

cloudfool commented 5 years ago

我想问下,有没有中文或者英文的 文本行的数据集?类似caffe-ocr人工合成的那种。

wanghaisheng commented 5 years ago

@cloudfool 大家都是结合自己实际处理的场景 套用现有的一些生成工具来造的 真实场景的话 英文的还挺多 中文的比较少 但可以用其他一些来造(比如你处理的是论文类型的文档)

cloudfool commented 5 years ago

请问英文的文本行数据集有哪些开源的?我找了很多,都是那种单词级的(比如ICDAR),我想要的是句子级别的。

wanghaisheng commented 5 years ago

@cloudfool 我上面列的你都看过了不~ https://github.com/NVlabs/ocroseg/tree/master/testdata 句子级别 你要什么样的句子 古登堡计划的电子书 小说诗歌啥的txt直接可以造啊 用numpy这些

mttbx commented 5 years ago

@wanghaisheng 你好,给你github上展示的163邮箱发了一个邮件,需要你的帮助兄弟!

wanghaisheng commented 5 years ago

@mttbx 我找不到原始文件了。

LinnaWang76 commented 4 years ago

链接: https://pan.baidu.com/s/12Wstdz_u8iwr7NEJGQtnZg 密码: 7p2m HWDB2.2手写体VOC,需要的同志自取

兄弟,链接过期了!

wanghaisheng commented 4 years ago

@LinnaWang76 sorry 我已经忘记文件名称,无法在pan中找到文件对其重新进行分享

chixma commented 4 years ago

in marmot data set the table BBOX are not matching with original images

I am faced with the same issue. Do you have any idea about it later?