mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
721 stars 130 forks source link

Some GT of the ocropus models found #21

Closed zuphilip closed 7 years ago

zuphilip commented 8 years ago

One finds (some) ground truth of the ocropus models at http://www.tmbdev.net/ocrdata-hdf5/, but I don't know how complete this is. For example the Google-1000-Books seems missing. However, there is the data from MNIST, which are just (handwritten) numbers. Found (again) by looking at the IPython-notebook https://github.com/tmbdev/clstm/blob/master/misc/lstm-mnist-py.ipynb and remembered that we talked about this. (This might only partially be related to something specific in kraken, but my email bounced back.)

jbaiter commented 8 years ago

The raw Google-1000-Books dataset is still available from Google, although it's a bit cumbersome to get to:

wget http://commondatastorage.googleapis.com/books/icdar2007/README.txt
wget http://commondatastorage.googleapis.com/books/icdar2007/VERSION.txt
for i in $(seq -f "%04g" 0 9999); do
    wget -c "http://commondatastorage.googleapis.com/books/icdar2007/Volume_$i.zip"
done

I don't know if the data is really suitable as ground truth, though. Looking at a few volumes, it looks like it comes straight out of Google's OCR system (which as of 2011 was likely Tesseract), with a lot of errors.

amitdo commented 7 years ago

UNLV ISRI Document Collection for Research in OCR and Information Retrieval https://code.google.com/archive/p/isri-ocr-evaluation-tools/downloads

amitdo commented 7 years ago

High Performance OCR for Printed English and Fraktur using LSTM Networks (2013) Thomas M. Breuel, Adnan Ul-Hasan, Mayce Al Azawi. Faisal Shafait

For English input, we used the University of Washington (UW3) dataset, representing 1600 pages of document images from scientific journals and other common sources. Text line images and corresponding ground-truth text were extracted from the data set using the layout ground-truth and transcriptions provided with UW3. Text lines containing mathematical equations were not used during either training or testing. Overall, we used a random subset of 95,338 text-lines in the training set and 1,020 text lines in the test set.

Fraktur

The training set was a fairly small set of about 20,000 text lines of mostly artificially generated characters.