serialization / templates: prefer re-use of segment identifiers, if existing (from XML parser)

mittagessen / kraken

OCR engine for all the languages

http://kraken.re

Apache License 2.0

688 stars 125 forks source link

serialization / templates: prefer re-use of segment identifiers, if existing (from XML parser) #568

Closed bertsky closed 3 months ago

bertsky commented 5 months ago

When parsing ALTO or PAGE, you do already keep the identifiers of regions and lines. But the output throws this info away and generates vanilla block/line labels. It would be really useful if the normal behaviour would be idempotent regarding segment identifiers (so for example input and output, or GT and prediction can be easily compared).

mittagessen commented 5 months ago

It is an oversight on my part as it even throws away the automatically assigned UUIDs in the new container classes during serialization. I'll fix it but it will take a couple of weeks until I'll get to it.

mittagessen commented 3 months ago

5.0 preserves identifiers on the line and region levels now.

bertsky commented 3 months ago

Ah, there already is a 5.0 release, just not on Github.