Closed hhuseyinpay closed 9 months ago
Yes, you can create these programmatically but kraken currently doesn't include tooling for it. You'd basically have to put the lines on a page and determine their baselines. If you use something like Pango or Pillow for typesetting you can get it from some helper functions. Afterwards, I'd run everything through the polygonizer (kraken.lib.segmentation.calculate_polygonal_environment
) and then serialize into Page/ALTO files.
It is on my ToDo-list to upgrade the old ketos linegen
tool for page-wise rendering but it is fairly low priority.
Can I just ask a follow-up: Is the only thing required the baselines and the transcriptions linked to them? Will any other information about page segmentation (e.g. line and/or token outlines) be discarded as unnecessary?
I have been able to programmatically produce PAGE format files based on a TEI-encoded corpus and initial segmentation by Kraken, but haven't started training yet.
On 24/01/30 04:48AM, Tarrin Wills wrote:
Can I just ask a follow-up: Is the only thing required the baselines and the transcriptions linked to them? Will any other information about page segmentation (e.g. line and/or token outlines) be discarded as unnecessary?
Anything below line level isn't used for training but you also need the
line-wise bounding polygons (<Coords points ...
under <TextLine>
).
They aren't trained so you can just have kraken calculate them. There's
a script repolygonize.py
in contrib/
that does exactly that but you
need to have the actual polygon XML elements already in the source doc
(you can just fill them with dummy values).
I hope that helps.
That helps heaps, thanks.
Hello, I want to create an Ottoman language model from scratch. I currently have access to a corpus that includes text spanning over 6000 pages, presented in 4 different fonts. Can I programmatically generate XML files (either in ALTO or PAGE format) corresponding to PNG images(which I generate programmatically) of the text? Any guidance, advice, or resources on this topic would be very helpful.
Thank you.