Open bertsky opened 2 years ago
Judging by your https://github.com/kba/transkribus-to-prima/commit/759a6853bdeb6e09994991734689a1e113b221c8, I surmise that Relation
is used in lieu of ReadingOrder
in Transkribus. Do you know for sure that Relation
is not also used for other things (like caption or drop-cap)? And that ReadingOrder
is never used (so we can safely overwrite it)?
Judging by your 759a685, I surmise that
Relation
is used in lieu ofReadingOrder
in Transkribus. Do you know for sure thatRelation
is not also used for other things (like caption or drop-cap)? And thatReadingOrder
is never used (so we can safely overwrite it)?
This fix was based on data provided by @socket0 and his colleagues. @tboenig Do you know whether/how Relations
are used in Transkribus besides (incorrectly AFAICT) for reading order? Does Transkribus support ReadingOrder
as intended? Having little experience with Transkribus myself, it's hard to tell. Searching the TranskribusCore repo does seem to indicate that pc:ReadingOrder
is used in the code, but it might just be the XSD-generated code.
Perhaps we could offer distinct converters for the various reading order conventions that Transkribus seems to use:
Relation
@custom
(which we have not implemented yet)ReadingOrder
?Meanwhile, I did discover other cases of ReadingOrder conventions from Transkribus.
https://zenodo.org/record/257972, https://zenodo.org/record/1297399, https://zenodo.org/record/3945088, https://zenodo.org/record/1322666 and https://zenodo.org/record/1243098 – these all use both ReadingOrder and @custom="readingOrder {index:...;}"
(the latter apparently to represent line-level reading order). So that seems to be the normal/common case.
I suggest we first try to find Relation/RegionRef
(instead of SourceRegionRef
and TargetRegionRef
). If that exists, use the existing converter. Otherwise, ignore Relation (assuming it has PRImA semantics already) but turn to ReadingOrder and @custom
: If the sequence of @custom
indices within a region deviates from the element order, then sort the elements. (The @custom
itself can be left unchanged, as it does not violate the schema.)
Related: #18
IIUC it searches
Relation
elements of@type=link
, then creates a newOrderedGroupIndexed
comprising the relation'sRegionRef
s, and appends that to the globalReadingOrder/OrderedGroup
. It then removes these relations. I then removes all relations.I don't understand the use-case for that. AFAICS, the deviation between Transkribus and PRImA here is only in that
RelationType
takes any number ofRegionRef
in the former, but a singleSourceRegionRef
and a singleTargetRegionRef
in the latter.So why not just do that conversion? What does reading order have to do with it? What if the top-level RO does not exist before, or contains an
UnorderedGroup
only? Why remove the relation entirely? Why even remove all other relations?