Sample training dataset

ocr-d-modul-2-segmentierung / page-segmentation

Pixel classifier for historic prints. Mirror of https://gitlab2.informatik.uni-wuerzburg.de/ls6/ocr4all-pixel-classifier. Report issues here.

Other

5 stars 2 forks source link

Hi, you can find some publicly available data in the OCR-D repositories. These contain color images and PAGEXML. You'll have to generate binarized images and then follow the examples for dataset generation to generate the json files.

The example assumes subfolders of binary, jpg and page for binarization, color image and xml. In the OCR-D zip files, the folder for color image is OCR-D-IMG, and you should use the PAGEXML from OCR-D-GT-SEG-BLOCK, so change the paths in the example accordingly (as I said you'll have to generate binaries yourself, put them in a separate folder).

ocr-d-modul-2-segmentierung / page-segmentation

Sample training dataset #1