kba / transkribus-to-prima

Convert Transkribus PAGE-XML to standard PAGE-XML
11 stars 2 forks source link

negative region coordinates and empty regions #19

Open jahtz opened 8 months ago

jahtz commented 8 months ago

After running the script, I noticed some negative coordinates, not enough coordinates and some empty regions.

negative coordinates:

<TextLine id="r1l31" custom="readingOrder {index:28;}">
    <Coords points="304,4432 2797,4482 2799,4365 1058,4337 -1323,4351"/>
    <Baseline points="320,4410 443,4412 566,4414 689,4416 812,4418 935,4421 1058,4422 1181,4424 1304,4426 1427,4428 1550,4430 1673,4434 1796,4436 1919,4438 2042,4442 2165,4446 2288,4450 2411,4454 2534,4460 2657,4464 2780,4470"/>
    <TextEquiv>
        ...

-> Value '304,4432 2797,4482 2799,4365 1058,4337 -1323,4351' is not facet-valid with respect to pattern '([0-9]+,[0-9]+ )+([0-9]+,[0-9]+)' for type 'PointsType'.

not enough coordinates and empty regions:

<TextRegion id="region_1535370511662_1" custom="readingOrder {index:1;}">
    <Coords points="206,1554"/>
        <TextEquiv>
            <Unicode/>
        </TextEquiv>
</TextRegion>

-> Value '206,1554' is not facet-valid with respect to pattern '([0-9]+,[0-9]+ )+([0-9]+,[0-9]+)' for type 'PointsType'.

ty!

stweil commented 8 months ago

Could you please append an example PAGE file which can be used to reproduce the issue?

jahtz commented 8 months ago

Sorry for the delay. The files are attached below. _negativel218.xml: negative coordinates at line 821 and _empty_linel821.xml: empty region at line 218 Thank you very much! xmls.zip