[2022] TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

osuossu8 commented 1 year ago

osuossu8 commented 1 year ago

In the first stage, we synthesize a largescale dataset consisting of hundreds of millions of printed textline images and pre-train the TrOCR models on that.
In the second stage, we build two relatively small datasets corresponding to printed and handwritten downstream tasks, containing millions of textline images each.
- 1st stage ... 何億もの合成データ
- 2nd stage ... 印刷手書きそれぞれに対応する数百万もの合成データ

osuossu8 commented 1 year ago

Pre-training Dataset

stage 1

インターネット上で公開されている PDF ファイルから 200 万ページのドキュメントをサンプリング
ページ画像に変換し、トリミングされた画像でテキスト行を抽出
第 1 段階の事前トレーニングデータセットには 6 億 8,400 万のテキスト行
5,427 の手書きフォント
- https://fonts.google.com/
- https://www.1001fonts.com/handwritten-fonts.html
画像生成エンジンで生成
- https://github.com/Belval/TextRecognitionDataGenerator

stage2

手書き認識
- IIIT-HWS データセット (Krishnan and Jawahar 2016) を含む 1,790 万のテキスト行。
印刷文字認識
- 現実世界で約 53,000 のレシート画像を収集し、商用 OCR エンジンによってそれらのテキストを認識。
- 結果に従って、テキスト行を座標でトリミングし、正規化された画像に修正。
- TRDG を使用して、2 つのレシートフォントと組み込みの印刷フォントを使用して 100 万の印刷テキスト行イメージを合成。
- 330 万行のテキスト行で構成。
シーンテキスト認識
- MJSynth (MJ) (Jaderberg et al. 2014) および SynthText (ST) (Gupta、Vedaldi、および Zisserman 2016) 。
- 合計約 1600 万行のテキスト画像。

osuossu8 commented 1 year ago

Benchmark

The SROIE (Scanned Receipts OCR and Information Extraction) dataset (Task 2) focuses on text recognition in receipt images
- train : 626
- test : 361
- テキスト行のトリミング画像 (矩形) を評価に使用
The IAM Handwriting Database
- We use the Aachen’s partition of the dataset3 : 6,161 lines from 747 forms in the train set, 966 lines from 115 forms in the validation set and 2,915 lines from 336 forms in the test set.
Recognizing scene text images
- widely-used benchmarks,
- IIIT5K-3000 (Mishra, Alahari, and Jawahar 2012), SVT-647 (Wang, Babenko, and Belongie 2011), IC13-857, IC13-1015 (Karatzas et al. 2013), IC15-1811, IC15-2077 (Karatzas et al. 2015), SVTP-645 (Phan et al. 2013), and CT80-288

osuossu8 commented 1 year ago

32 V100 GPUs with the memory of 32GBs for pre-training and 8 V100 GPUs for fine-tuning.
the batch size is set to 2,048 and the learning rate is 5e-5
the 384×384 resolution and 16×16 patch size for DeiT and BEiT encoders

osuossu8 / paper-reading