Open bertsky opened 3 years ago
It is a ComposedBlock
:
Since you're working with invoices and such, can you please share some samples for tables in PAGE-XML, then I can improve and test the table conversion.
It is a
ComposedBlock
:
Sorry, I was reading too sloppily.
Since you're working with invoices and such, can you please share some samples for tables in PAGE-XML, then I can improve and test the table conversion.
Sure. How about assets/data/gutachten/data
?
For the sample gutachten/data/TEMP1/PAGE_TEMP1.xml
, the current behavior seems to be correct:
<TableRegion>
<TextRegion>
<TextLine>
in PAGE becomes in ALTO:
<ComposedBlock>
<TextBlock>
<Textline>
I couldn't find a sample for a more complex table with deeper recursion than 1.
f138114 should support arbitrarily deep nesting in tables if I got the recursion right.
f138114 should support arbitrarily deep nesting in tables if I got the recursion right.
Yes, I think you did. But there are more cases: in PAGE, TextRegion
can itself contain both nested TextRegion
s and immediate TextLines
. And all region types are recursive, not just tables.
The problem is that in ALTO, TextBlock
is not recursive, only ComposedBlock
is. And ComposedBock
is not allowed to have TextLine
s directly.
So you could (/probably need to) generalize the current pattern. But we would need to split up PAGE's "typed recursion" into ALTO's "pure recursion".
For example, if you have a GraphicRegion
with embedded TextRegion
s, that would need to become a ComposedBlock
comprised of an equally located/sized Illustration
(which also maps its @type
) followed by a list of TextBlock
s for each embedded region.
Or if you have a TextRegion
with immediate TextLine
s as well as embedded TextRegion
s, that would need to become a ComposedBlock
comprised of an equally located TextBlock
(with all the TextLine
s and its @type
and @primaryLanguage
), followed by a list of TextBlock
s for the embedded regions.
Its unclear though, what to do with the TextEquiv
at the region level (esp. if there's no line level below it) and other PAGE-specific info under TextRegion
(like @leading
/ @align
/ @indented
or @primaryScript
or the order/direction attributes).
I'll try to implement basic and mixed-lines/regions recursion with ComposedBlock
.
Its unclear though, what to do with the TextEquiv at the region level
There is nothing we can do I think. ALTO only allows content for String
.
@leading
could be mapped to @LINESPACE
, @align
is implemented via ParagraphStyle
. @indented
could be mapped to either @LEFT
or @FIRSTLINE
?
The behavior is buggy, it duplicates TextRegions within TableRegions in PAGE to a ComposedBlock
and a TextBlock
on the same level.
https://github.com/kba/page-to-alto/blob/46a8cc2fb74ce327e9d195f1095699cbae946cce/ocrd_page_to_alto/convert.py#L158
I think it's not enough to just map the lower levels here. There might not be any cell segmentation yet, only a detected table. And even if there is structure below that level, it's worthwhile mapping the recursive structure 1:1.
For that, there's the equivalent
ComposedBlock
in ALTO.