A Deep Study of Ocropy on Old Books Pages

For the last 2 years I've been studying Ocropy and its approaches to the many OCRing steps. Particularly, I conducted my experiments on an old books dataset, which I created and is available at: https://github.com/PedroBarcha/old-books-dataset (300 dpi version) For further reference, I will name it Original Dataset.

I proposed several alternatives to the Ocropy’s original approaches and analyzed their results with the free software tool OCRevaluation, available at: https://github.com/impactcentre/ocrevalUAtion My results are expressed in the form of CER (Character Error Rate).

In the end, I could get an improvement of around 43% in the CER! That being said, I would like to share some particularly interesting results.

1- Binarization

This was the step in which I could get the best enhancement. I binarized the dataset with several thresholding techniques of the package image of Octave (check: https://octave.sourceforge.io/image/function/graythresh.html). I then OCRed each binarized version of the set with Ocropy. The results are shown below: Chart 1: the Character Error Rate for each binarized version of the dataset. The red one is the result of the OCRing over the original set (not binarized) (thus our base result). The others represent the versions binarized by Octave.

1.1- Binarization of Pages with Pictures Analyzing individually the pages binarized by Ocropy’s original thresholding method, I realized that pages that have pictures don’t have them transformed into black blocks. Thus, these pictures are later recognized as HUGE amounts of spurious characters (trash). (sorry for the low quality pics) Figure 1 (left): page binarized with Ocropy’s original method Figure 2 (right): page binarized with Otsu

Otsu and Intermeans methods make the pictures in the pages turn into black boxes, which are later classified as non-character blocks by Ocropy (and thus are not incorrectly recognized as lots of random characters). Both methods binarize the pages based on a global histogram, differently from Ocropy’s adaptive threshold; this explains the black box results.

1.2- Binarization of Pages with No Pictures Although not as relevant as the situation aforementioned, pages without pictures had also better results with the alternative binarization methods of Table 1. In this aspect, Minimum showed the best results.

2- Training

In this step new models were trained over Ocropy's default one, in order to try to get better results for old books pages. A new dataset was created for this purpose. The K-folds technique (with K=10) was used with the new dataset, divided on 60% training, 20% test and 20% cross-validation. In the first part, the Number of Interactions hyperparameter was tested. In the second (which I will not describe here, because I couldn’t get any enhancement out of it), the learning step was tested.

2.1- Number of Interactions 30000 interactions (for each fold) were performed, with each model being recorded every 500 interactions. I made a script that plots the average CER of the 10 folds, for each model recorded. Furthermore, the script indicates the models that had the best results. Figure 3: average Character Error Rate (of the 10 folds), for the training set and for the test set. Y-axis: character error rate. X-axis: number of interactions.

(For some reason, there was a peak around 22.000 interactions, for every fold. Does anyone know the reason?)

The lowest average error occurred for 16.000 interactions, reducing about 32% of the original CER for the test set (i.e Ocropy’s original results for the test set). Particularly, the model that presented the best results for the test set reduced the CER on 60,2%, compared to Ocropy’s original results for it. Furthermore, this model also presented the best results on the Original Dataset (the one used on the binarization steps). I named this model BEST. Its results on the Original Dataset is shown below: Chart 2: results of the recognition of the Original Dataset and its Binarized Version from section 1, using Ocropy’s standard trained model and using the Best model.

3- Segmentation

I played with all the parameters available at ocropus-gpageseg, in order to see their results on the segmentation of the old books pages. Specifically for the threshold parameter, responsible for adjusting the baseline threshold, I could get sensible enhancement. Again the reason is related to the pages that contain pictures: with the new (higher) threshold value, a higher number of connected components of the pictures within the page were discarded, resulting in fewer spurious characters recognized later on. The threshold value was changed from 0.2 to 0.45 resulting on: Chart 3: Ocropy’s recognition results on the Original Dataset, with threshold=0.2 and with threshold=0.45.

4- Final Results

Considering all of the enhancements obtained, I could reduce the CER from 7.91% to 4.55%, an improvement of around 43% !!! Chart 4: Ocropy’s recognition results on the Original Dataset, considering the enhancements obtained throughout the study.

I carried this research in order to see how far we can user OCRing to turn old books (in the public domain) into PDFs and ebooks, and I am really glad that we are already able to get satisfactory results. If the community finds any of these results interesting somehow, I would be glad to implement it and contribute to Ocropy!.

Git revision of ocropy: 02007714ffb27e4595ea2133cf0bb351d1ed5856 Notices that is dates back to June 2016. I’ve been using the same version since then, in order to get coherent results.

ocropus-archive / DUP-ocropy