suub / ocr-engine-results

repository to store the different processed versions and statistics
MIT License
3 stars 0 forks source link

Line-coordinates for Grenzboten ground truth? #1

Closed jbaiter closed 8 years ago

jbaiter commented 8 years ago

Is the information about the position of the ground truth lines in the corresponding facsimile image available somewhere?

I would love to use the dataset to for training/evaluating Fraktur OCR, but without the coordinates I would have to resort to hacks like doing the line detection myself and aligning it with the ground truth, or reverse-engineer it from the text layer in the PDFs that are available from the catalogue.

mschuene commented 8 years ago

You can get detailed information of the scanned ocr results of abbyy, which we have only post corrected with our approach, in the abby xml files which can be downloaded from brema.suub.uni-bremen.de/grenzboten

You can crawl through the mets files starting from http://brema.suub.uni-bremen.de/grenzboten/oai/?verb=GetRecord&metadataPrefix=mets&identifier=282153" or get the abbyy xml files directly by the vlid with http://brema.suub.uni-bremen.de/grenzboten/download/fulltext/fr/ followed by the vlid which you can get from the filenames of the .txt files in this repository.

We used both approaches in https://github.com/suub/laser-experiments/blob/master/src/laser_experiments/core.clj.

jbaiter commented 8 years ago

Thank you for the incredibly quick answer! :-)