How to create training dataset programmatically?

mittagessen / kraken

OCR engine for all the languages

http://kraken.re

Apache License 2.0

750 stars 131 forks source link

How to create training dataset programmatically? #562

Closed hhuseyinpay closed 9 months ago

hhuseyinpay commented 10 months ago

Hello, I want to create an Ottoman language model from scratch. I currently have access to a corpus that includes text spanning over 6000 pages, presented in 4 different fonts. Can I programmatically generate XML files (either in ALTO or PAGE format) corresponding to PNG images(which I generate programmatically) of the text? Any guidance, advice, or resources on this topic would be very helpful.

Thank you.

mittagessen commented 10 months ago

Yes, you can create these programmatically but kraken currently doesn't include tooling for it. You'd basically have to put the lines on a page and determine their baselines. If you use something like Pango or Pillow for typesetting you can get it from some helper functions. Afterwards, I'd run everything through the polygonizer (kraken.lib.segmentation.calculate_polygonal_environment) and then serialize into Page/ALTO files.

It is on my ToDo-list to upgrade the old ketos linegen tool for page-wise rendering but it is fairly low priority.

tarrinw commented 9 months ago

Can I just ask a follow-up: Is the only thing required the baselines and the transcriptions linked to them? Will any other information about page segmentation (e.g. line and/or token outlines) be discarded as unnecessary?

I have been able to programmatically produce PAGE format files based on a TEI-encoded corpus and initial segmentation by Kraken, but haven't started training yet.

mittagessen commented 9 months ago

On 24/01/30 04:48AM, Tarrin Wills wrote:

Can I just ask a follow-up: Is the only thing required the baselines and the transcriptions linked to them? Will any other information about page segmentation (e.g. line and/or token outlines) be discarded as unnecessary?

Anything below line level isn't used for training but you also need the line-wise bounding polygons (<Coords points ... under <TextLine>). They aren't trained so you can just have kraken calculate them. There's a script repolygonize.py in contrib/ that does exactly that but you need to have the actual polygon XML elements already in the source doc (you can just fill them with dummy values).

I hope that helps.

tarrinw commented 9 months ago

That helps heaps, thanks.