ocr-d-modul-2-segmentierung / page-segmentation

Pixel classifier for historic prints. Mirror of https://gitlab2.informatik.uni-wuerzburg.de/ls6/ocr4all-pixel-classifier. Report issues here.
Other
5 stars 2 forks source link

Sample training dataset #1

Closed ghost closed 4 years ago

ghost commented 4 years ago

Thank you for your hard work

It would be great if you upload a sample training dataset & json file, just so that i can understand the structure of the dataset required for training. even if it's just a few images.

crater2150 commented 4 years ago

Hi, you can find some publicly available data in the OCR-D repositories. These contain color images and PAGEXML. You'll have to generate binarized images and then follow the examples for dataset generation to generate the json files.

The example assumes subfolders of binary, jpg and page for binarization, color image and xml. In the OCR-D zip files, the folder for color image is OCR-D-IMG, and you should use the PAGEXML from OCR-D-GT-SEG-BLOCK, so change the paths in the example accordingly (as I said you'll have to generate binaries yourself, put them in a separate folder).