Closed stweil closed 2 years ago
These are some examples where Tesseract OCR disagrees with the GT data, and Tesseract is right.
Only slightly related to this issue:
@pstroe, I noticed today that my line extractors extracted images with a wrong vertical offset for some of the Transkribus PAGE XML files (see https://github.com/OCR-D/format-converters/issues/16). This affects evaluation images, and I expect that it affects training images, too. Fixing that should give even better Tesseract OCR results. How did you produce line images for OCR?
Meanwhile I have fixed around 150 words in the evaluation set, and I estimate that maybe twice that number is wrong in the initial evaluation set. For the training data set I expect a similar error rate in the GT texts. It would be interesting how that affects the trained models and the accuracy numbers.
The PAGE XML which was generated by Transkribus is not only invalid XML but also contains strange word and line boxes. See https://github.com/Transkribus/TranskribusCore/issues/46 for details.
Signed-off-by: Stefan Weil sw@weilnetz.de