impresso / NZZ-black-letter-ground-truth

Other
8 stars 1 forks source link

Ground truth for Neue Zürcher Zeitung black letter period

In order to be able to assess the OCR quality of newspapers and also in order to be able to train new OCR recognition models, it is necessary to have a ground truth at one's disposal.

Sampling

The Neue Zürcher Zeitung (NZZ) has been publishing in black letter from its very first issue in 1780 until 1947. From this time period, we randomly sampled one frontpage per year, resulting in a total of 167 pages. We chose frontpages because they typically contain highly relevant material and because we want to make sure not to sample pages containing exclusively advertisements or stock information. During certain periods, the NZZ was published several times a day, and there were supplements, too. Due to incomplete metadata, the sampling included frontpages from supplements.

Ground truth production

In order to speed up the process of the ground truth production, we uploaded the 167 images to Transkribus and OCRed the text with its internal ABBYY FineReader Server 11. We then used Transkribus to manually correct the text. For about 100 pages, we corrected on the word level and the line level was automatically synchronized by Transkribus. For the rest of the pages, only the line level was corrected.

When the transcription of 120 pages was finished, the Transkribus team trained an HTR model with which we recognized the text from the remaining pages. This significantly speeded up our process, however, these XML files do not contain any word-level information.

Guidelines

There are pages which have been slightly cut at the right-hand side. This stems from the digitisation process by the NZZ.

Please note that for pages which have only been corrected on the line level, the ground truth XML files still contains the uncorrected text on the word level!

Training and test splits used for Transkribus HTR model evaluation

Our DH2019 paper about Transkribus HTR for improving the OCR of black letter in newspaper texts used the following years for testing: 1780, 1790, 1800, 1810, 1820, 1830, 1840, 1850, 1860, 1870, 1880, 1890, 1904, 1910, 1915, 1929, 1939 The repository contains a text file with the exact list of names.

Content

This NZZ ground truth contains several directories:

Transcribers:

Final remarks

All the data, which includes .xml and image files, in this repository is licensed under a Creative Commons license as specified in the file LICENSE.txt. This ground truth can be used for academic purposes.

Neue Zürcher Zeitung black letter ground truth (c) by Phillip Ströbel and Simon Clematide

Neue Zürcher Zeitung black letter ground truth is licensed under a Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.

You should have received a copy of the license along with this work. If not, see https://creativecommons.org/licenses/by-nc/4.0/legalcode.txt.

If you use it, please indicate the source as

@inproceedings{clematide-stroebel-2019,
  author = "Ströbel, Phillip and Clematide, Simon",
  title = "Improving OCR of Black Letter in Historical Newspapers: The Unreasonable Effectiveness of HTR Models on Low-Resolution Images",
  year = 2019,
  booktitle = "Proceedings of the Digital Humanities 2019, (DH2019)",
  note = "accepted"
}