wanghaisheng / awesome-ocr

A curated list of promising OCR resources
http://wanghaisheng.github.io/ocr-arxiv-daily/
MIT License
1.66k stars 351 forks source link

A learning-based approach to text image retrieval: using CNN features and improved similarity metrics #27

Closed wanghaisheng closed 6 years ago

wanghaisheng commented 7 years ago

论文原始链接 论文下载

wanghaisheng commented 7 years ago

摘要

Text content can have different visual presentation ways with roughly similar characters. While conventional text image retrieval depends on complex model of OCR-based text recognition and text similarity detection, this paper proposes a new learning-based approach to text image retrieval with the purpose of finding out the original or similar text through a query text image. Firstly, features of text images are extracted by the CNN network to obtain the deep visual representations. Then, the dimension of CNN features is reduced by PCA method to improve the efficiency of similarity detection. Based on that, an improved similarity metrics with article theme relevance filtering is proposed to improve the retrieval accuracy. In experimental procedure, we collect a group of academic papers both including English and Chinese as the text database, and cut them into pieces of text image. A text image with changed text content is used as the query image, experimental results show that the proposed approach has good ability to retrieve the original text content.

wanghaisheng commented 7 years ago

这个看起来是以图找图 只不过这种图是纯文本的图 对于像论文查重这样的场景是有用的

wanghaisheng commented 7 years ago

https://arxiv.org/pdf/1703.06618.pdf Twitter100k: A Real-world Dataset for Weakly Supervised Cross-Media Retrieval 同样一篇用来找相同文本的 论文