kba / page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Apache License 2.0
14 stars 5 forks source link

TableRegion should become ComposedBlock #1

Open bertsky opened 3 years ago

bertsky commented 3 years ago

https://github.com/kba/page-to-alto/blob/46a8cc2fb74ce327e9d195f1095699cbae946cce/ocrd_page_to_alto/convert.py#L158

I think it's not enough to just map the lower levels here. There might not be any cell segmentation yet, only a detected table. And even if there is structure below that level, it's worthwhile mapping the recursive structure 1:1.

For that, there's the equivalent ComposedBlock in ALTO.

kba commented 3 years ago

It is a ComposedBlock:

https://github.com/kba/page-to-alto/blob/46a8cc2fb74ce327e9d195f1095699cbae946cce/ocrd_page_to_alto/convert.py#L25

Since you're working with invoices and such, can you please share some samples for tables in PAGE-XML, then I can improve and test the table conversion.

bertsky commented 3 years ago

It is a ComposedBlock:

Sorry, I was reading too sloppily.

Since you're working with invoices and such, can you please share some samples for tables in PAGE-XML, then I can improve and test the table conversion.

Sure. How about assets/data/gutachten/data?

kba commented 3 years ago

For the sample gutachten/data/TEMP1/PAGE_TEMP1.xml, the current behavior seems to be correct:

<TableRegion>
  <TextRegion>
     <TextLine>

in PAGE becomes in ALTO:

<ComposedBlock>
  <TextBlock>
    <Textline>

I couldn't find a sample for a more complex table with deeper recursion than 1.

kba commented 3 years ago

f138114 should support arbitrarily deep nesting in tables if I got the recursion right.

bertsky commented 3 years ago

f138114 should support arbitrarily deep nesting in tables if I got the recursion right.

Yes, I think you did. But there are more cases: in PAGE, TextRegion can itself contain both nested TextRegions and immediate TextLines. And all region types are recursive, not just tables.

The problem is that in ALTO, TextBlock is not recursive, only ComposedBlock is. And ComposedBock is not allowed to have TextLines directly.

So you could (/probably need to) generalize the current pattern. But we would need to split up PAGE's "typed recursion" into ALTO's "pure recursion".

For example, if you have a GraphicRegion with embedded TextRegions, that would need to become a ComposedBlock comprised of an equally located/sized Illustration (which also maps its @type) followed by a list of TextBlocks for each embedded region.

Or if you have a TextRegion with immediate TextLines as well as embedded TextRegions, that would need to become a ComposedBlock comprised of an equally located TextBlock (with all the TextLines and its @type and @primaryLanguage), followed by a list of TextBlocks for the embedded regions.

Its unclear though, what to do with the TextEquiv at the region level (esp. if there's no line level below it) and other PAGE-specific info under TextRegion (like @leading / @align / @indented or @primaryScript or the order/direction attributes).

kba commented 3 years ago

I'll try to implement basic and mixed-lines/regions recursion with ComposedBlock.

Its unclear though, what to do with the TextEquiv at the region level

There is nothing we can do I think. ALTO only allows content for String.

@leading could be mapped to @LINESPACE, @align is implemented via ParagraphStyle. @indented could be mapped to either @LEFT or @FIRSTLINE?

kba commented 5 months ago

The behavior is buggy, it duplicates TextRegions within TableRegions in PAGE to a ComposedBlock and a TextBlock on the same level.