impresso / NZZ-black-letter-ground-truth

Other
8 stars 1 forks source link

Fix transcriptions in evaluation set #5

Closed stweil closed 2 years ago

stweil commented 4 years ago

Signed-off-by: Stefan Weil sw@weilnetz.de

stweil commented 4 years ago

These are some examples where Tesseract OCR disagrees with the GT data, and Tesseract is right.

stweil commented 4 years ago

Only slightly related to this issue:

@pstroe, I noticed today that my line extractors extracted images with a wrong vertical offset for some of the Transkribus PAGE XML files (see https://github.com/OCR-D/format-converters/issues/16). This affects evaluation images, and I expect that it affects training images, too. Fixing that should give even better Tesseract OCR results. How did you produce line images for OCR?

stweil commented 4 years ago

Meanwhile I have fixed around 150 words in the evaluation set, and I estimate that maybe twice that number is wrong in the initial evaluation set. For the training data set I expect a similar error rate in the GT texts. It would be interesting how that affects the trained models and the accuracy numbers.

stweil commented 4 years ago

The PAGE XML which was generated by Transkribus is not only invalid XML but also contains strange word and line boxes. See https://github.com/Transkribus/TranskribusCore/issues/46 for details.