Closed zuphilip closed 7 years ago
The raw Google-1000-Books dataset is still available from Google, although it's a bit cumbersome to get to:
wget http://commondatastorage.googleapis.com/books/icdar2007/README.txt
wget http://commondatastorage.googleapis.com/books/icdar2007/VERSION.txt
for i in $(seq -f "%04g" 0 9999); do
wget -c "http://commondatastorage.googleapis.com/books/icdar2007/Volume_$i.zip"
done
I don't know if the data is really suitable as ground truth, though. Looking at a few volumes, it looks like it comes straight out of Google's OCR system (which as of 2011 was likely Tesseract), with a lot of errors.
UNLV ISRI Document Collection for Research in OCR and Information Retrieval https://code.google.com/archive/p/isri-ocr-evaluation-tools/downloads
High Performance OCR for Printed English and Fraktur using LSTM Networks (2013) Thomas M. Breuel, Adnan Ul-Hasan, Mayce Al Azawi. Faisal Shafait
For English input, we used the University of Washington (UW3) dataset, representing 1600 pages of document images from scientific journals and other common sources. Text line images and corresponding ground-truth text were extracted from the data set using the layout ground-truth and transcriptions provided with UW3. Text lines containing mathematical equations were not used during either training or testing. Overall, we used a random subset of 95,338 text-lines in the training set and 1,020 text lines in the test set.
Fraktur
The training set was a fairly small set of about 20,000 text lines of mostly artificially generated characters.
One finds (some) ground truth of the ocropus models at http://www.tmbdev.net/ocrdata-hdf5/, but I don't know how complete this is. For example the Google-1000-Books seems missing. However, there is the data from MNIST, which are just (handwritten) numbers. Found (again) by looking at the IPython-notebook https://github.com/tmbdev/clstm/blob/master/misc/lstm-mnist-py.ipynb and remembered that we talked about this. (This might only partially be related to something specific in kraken, but my email bounced back.)