Closed jbaiter closed 8 years ago
You can get detailed information of the scanned ocr results of abbyy, which we have only post corrected with our approach, in the abby xml files which can be downloaded from brema.suub.uni-bremen.de/grenzboten
You can crawl through the mets files starting from http://brema.suub.uni-bremen.de/grenzboten/oai/?verb=GetRecord&metadataPrefix=mets&identifier=282153" or get the abbyy xml files directly by the vlid with http://brema.suub.uni-bremen.de/grenzboten/download/fulltext/fr/ followed by the vlid which you can get from the filenames of the .txt files in this repository.
We used both approaches in https://github.com/suub/laser-experiments/blob/master/src/laser_experiments/core.clj.
Thank you for the incredibly quick answer! :-)
Is the information about the position of the ground truth lines in the corresponding facsimile image available somewhere?
I would love to use the dataset to for training/evaluating Fraktur OCR, but without the coordinates I would have to resort to hacks like doing the line detection myself and aligning it with the ground truth, or reverse-engineer it from the text layer in the PDFs that are available from the catalogue.