kba / page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Apache License 2.0
14 stars 5 forks source link

convert_text: iterate through regions in reading-order #27

Closed kba closed 2 years ago

kba commented 2 years ago

With this change, the regions to be converted are iterated in order of the pc:ReadingOrder instead of document order. This means that in the resulting ALTO, document order represents reading order.

The simple conversion of reading order into IDNEXT pointers from one region to the next is insufficient because this still requires one to know the first region and it's unclear whether ALTO consumers implement this. A better way would be proper ReadingOrder markup for ALTO but this will take a while to be specified and then implemented.

I think the benefits of being able to read the texts in the right order with the tools we have outweigh the loss of original document order if a pc:ReadingOrder is present,

An example to illustrate the issue, in https://digital.staatsbibliothek-berlin.de/werkansicht?PPN=PPN1025202716&PHYSID=PHYS_0010&DMDID=&view=fulltext-parallel the paragraphs detected should start, in order , with

But due to the ALTO output of page-to-alto, it is rendered as

kba commented 2 years ago

What if there are use-cases which attach some semantics to the document order?

It's a tradeoff, authenticity vs. usability. But making this behavior configurable gives users the choice, I'll add a parameter for this.

kba commented 2 years ago

Now a kwarg region_order / CLI option --region-order to determine this behavior, passed to get_AllRegions. Default is document-order iteration as before.

bertsky commented 2 years ago

Looking good!