Closed kba closed 2 years ago
What if there are use-cases which attach some semantics to the document order?
It's a tradeoff, authenticity vs. usability. But making this behavior configurable gives users the choice, I'll add a parameter for this.
Now a kwarg region_order
/ CLI option --region-order
to determine this behavior, passed to get_AllRegions
. Default is document-order iteration as before.
Looking good!
With this change, the regions to be converted are iterated in order of the
pc:ReadingOrder
instead of document order. This means that in the resulting ALTO, document order represents reading order.The simple conversion of reading order into
IDNEXT
pointers from one region to the next is insufficient because this still requires one to know the first region and it's unclear whether ALTO consumers implement this. A better way would be proper ReadingOrder markup for ALTO but this will take a while to be specified and then implemented.I think the benefits of being able to read the texts in the right order with the tools we have outweigh the loss of original document order if a
pc:ReadingOrder
is present,An example to illustrate the issue, in https://digital.staatsbibliothek-berlin.de/werkansicht?PPN=PPN1025202716&PHYSID=PHYS_0010&DMDID=&view=fulltext-parallel the paragraphs detected should start, in order , with
des halben Quadranten
Dieſer Mangel
wird
But due to the ALTO output of page-to-alto, it is rendered as
wird
Dieser Mangel
des halben Quadranten