kba / transkribus-to-prima

Convert Transkribus PAGE-XML to standard PAGE-XML
11 stars 2 forks source link

reading_order fixer: what does it do? #9

Open bertsky opened 2 years ago

bertsky commented 2 years ago

IIUC it searches Relation elements of @type=link, then creates a new OrderedGroupIndexed comprising the relation's RegionRefs, and appends that to the global ReadingOrder/OrderedGroup. It then removes these relations. I then removes all relations.

I don't understand the use-case for that. AFAICS, the deviation between Transkribus and PRImA here is only in that RelationType takes any number of RegionRef in the former, but a single SourceRegionRef and a single TargetRegionRef in the latter.

So why not just do that conversion? What does reading order have to do with it? What if the top-level RO does not exist before, or contains an UnorderedGroup only? Why remove the relation entirely? Why even remove all other relations?

bertsky commented 2 years ago

Judging by your https://github.com/kba/transkribus-to-prima/commit/759a6853bdeb6e09994991734689a1e113b221c8, I surmise that Relation is used in lieu of ReadingOrder in Transkribus. Do you know for sure that Relation is not also used for other things (like caption or drop-cap)? And that ReadingOrder is never used (so we can safely overwrite it)?

kba commented 2 years ago

Judging by your 759a685, I surmise that Relation is used in lieu of ReadingOrder in Transkribus. Do you know for sure that Relation is not also used for other things (like caption or drop-cap)? And that ReadingOrder is never used (so we can safely overwrite it)?

This fix was based on data provided by @socket0 and his colleagues. @tboenig Do you know whether/how Relations are used in Transkribus besides (incorrectly AFAICT) for reading order? Does Transkribus support ReadingOrder as intended? Having little experience with Transkribus myself, it's hard to tell. Searching the TranskribusCore repo does seem to indicate that pc:ReadingOrder is used in the code, but it might just be the XSD-generated code.

bertsky commented 2 years ago

Perhaps we could offer distinct converters for the various reading order conventions that Transkribus seems to use:

bertsky commented 1 year ago

Meanwhile, I did discover other cases of ReadingOrder conventions from Transkribus.

https://zenodo.org/record/257972, https://zenodo.org/record/1297399, https://zenodo.org/record/3945088, https://zenodo.org/record/1322666 and https://zenodo.org/record/1243098 – these all use both ReadingOrder and @custom="readingOrder {index:...;}" (the latter apparently to represent line-level reading order). So that seems to be the normal/common case.

I suggest we first try to find Relation/RegionRef (instead of SourceRegionRef and TargetRegionRef). If that exists, use the existing converter. Otherwise, ignore Relation (assuming it has PRImA semantics already) but turn to ReadingOrder and @custom: If the sequence of @custom indices within a region deviates from the element order, then sort the elements. (The @custom itself can be left unchanged, as it does not violate the schema.)

bertsky commented 1 year ago

Related: #18