Closed ghost closed 4 years ago
Hi, you can find some publicly available data in the OCR-D repositories. These contain color images and PAGEXML. You'll have to generate binarized images and then follow the examples for dataset generation to generate the json files.
The example assumes subfolders of binary
, jpg
and page
for binarization, color image and xml. In the OCR-D zip files, the folder for color image is OCR-D-IMG
, and you should use the PAGEXML from OCR-D-GT-SEG-BLOCK
, so change the paths in the example accordingly (as I said you'll have to generate binaries yourself, put them in a separate folder).
Thank you for your hard work
It would be great if you upload a sample training dataset & json file, just so that i can understand the structure of the dataset required for training. even if it's just a few images.