mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
747 stars 131 forks source link

`serialize_segmentation` handling of text line during Alto XML serialization #390

Closed colibrisson closed 2 years ago

colibrisson commented 2 years ago

The function serialization.serialize_segmentation is giving me a strange output. In the tag declaration at the top of the Alto XML document, I get duplicated OtherTag elements with weird labels. Moreover, the TAGREFS attributes are missing from all the TextLine elements, even the non-defaults ones.

<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/standards/alto/ns-v4#" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-2.xsd">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<sourceImageInformation>
<fileName>img/四库全书第0007册/四库全书第0007册0003-023.tif</fileName>
</sourceImageInformation>
<OCRProcessing ID="OCR_0">
<ocrProcessingStep>
<processingSoftware>
<softwareName>kraken</softwareName>
</processingSoftware>
</ocrProcessingStep>
</OCRProcessing>
</Description>
<Tags>
<OtherTag ID="TYPE_1" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_2" LABEL="dict_values(['DoubleLine'])"/>
<OtherTag ID="TYPE_3" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_4" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_5" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_6" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_7" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_8" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_9" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_10" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_11" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_12" LABEL="dict_values(['DoubleLine'])"/>
<OtherTag ID="TYPE_13" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_14" LABEL="dict_values(['DoubleLine'])"/>
<OtherTag ID="TYPE_15" LABEL="dict_values(['DoubleLine'])"/>
<OtherTag ID="TYPE_16" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_17" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_18" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_19" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_20" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_21" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_22" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_23" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_24" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_25" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_26" LABEL="dict_values(['DoubleLine'])"/>
<OtherTag ID="TYPE_27" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_28" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_29" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_30" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_31" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_32" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_33" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_34" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_35" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_36" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_37" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_38" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_39" LABEL="dict_values(['default'])"/>
<OtherTag ID="TYPE_40" LABEL="Margin"/>
<OtherTag ID="TYPE_41" LABEL="IllustrationLeft"/>
<OtherTag ID="TYPE_42" LABEL="Title"/>
<OtherTag ID="TYPE_43" LABEL="Author"/>
<OtherTag ID="TYPE_44" LABEL="Main"/>
<OtherTag ID="TYPE_45" LABEL="MainLeft"/>
<OtherTag ID="TYPE_46" LABEL="Illustration"/>
</Tags>
<Layout>
<Page WIDTH="4296" HEIGHT="6065" PHYSICAL_IMG_NR="0" ID="page_0">
<PrintSpace HPOS="0" VPOS="0" WIDTH="4296" HEIGHT="6065">
<TextBlock ID="block_3" HPOS="2090" VPOS="377" WIDTH="1693" HEIGHT="2571" TAGREFS="TYPE_44">
<Shape>
<Polygon POINTS="3756 2948 2090 2934 2104 377 3783 404 3756 2948"/>
</Shape>
<TextLine ID="line_0" HPOS="3547" VPOS="471" WIDTH="179" HEIGHT="2457" BASELINE="3628 475 3638 2928">
<Shape>
<Polygon POINTS="3628 475 3574 475 3574 822 3571 825 3547 869 3547 872 3547 876 3571 993 3577 1020 3577 1132 3571 1148 3547 1216 3547 1219 3547 1223 3547 1226 3571 1266 3571 1270 3571 1277 3550 1341 3550 1344 3550 1347 3550 1351 3567 1401 3577 1432 3567 1482 3560 1526 3560 1529 3567 1576 3574 1617 3567 1634 3550 1674 3550 1677 3550 1681 3567 1806 3571 1843 3567 1856 3550 1893 3550 1896 3550 1900 3554 2142 3554 2146 3554 2149 3564 2166 3574 2183 3564 2254 3554 2345 3554 2348 3564 2928 3638 2928 3726 2911 3705 2880 3726 2837 3726 2830 3726 2826 3726 2823 3726 2820 3726 2816 3722 2816 3699 2789 3719 2607 3719 2604 3719 2601 3719 2597 3699 2564 3719 2516 3726 2500 3726 2496 3726 2493 3726 2490 3719 2473 3699 2436 3716 2388 3722 2372 3722 2368 3722 2365 3722 2361 3722 2358 3719 2358 3716 2351 3699 2331 3699 2206 3712 2180 3726 2153 3726 2149 3726 2146 3726 2142 3726 2139 3712 2122 3699 2105 3712 2078 3722 2055 3722 2051 3722 2048 3722 2045 3722 2041 3712 2028 3695 2008 3709 1981 3722 1957 3722 1954 3722 1950 3722 1947 3709 1923 3695 1896 3709 1846 3719 1806 3719 1802 3719 1799 3719 1795 3719 1792 3709 1775 3695 1758 3705 1725 3722 1674 3722 1671 3722 1667 3702 1489 3695 1415 3702 1398 3722 1351 3722 1347 3722 1344 3722 1341 3699 1293 3719 1270 3719 1266 3719 1263 3719 1260 3699 1135 3719 1074 3719 1071 3719 1068 3719 1064 3719 1061 3716 1061 3695 1037 3692 1037 3695 1034 3719 987 3719 983 3719 980 3719 977 3695 916 3719 852 3719 849 3719 845 3719 842 3692 741 3719 704 3719 700 3719 697 3719 694 3719 690 3719 687 3716 687 3692 660 3719 593 3719 589 3719 586 3719 582 3685 471 3628 475"/>
</Shape>
<String CONTENT=""/>
</TextLine>
<TextLine ID="line_1" HPOS="3338" VPOS="468" WIDTH="175" HEIGHT="2453" BASELINE="3412 471 3426 2921">
<Shape>
<Polygon POINTS="3412 471 3372 471 3372 542 3338 603 3338 606 3338 609 3338 613 3338 616 3341 616 3368 650 3368 704 3341 731 3338 731 3338 734 3338 737 3338 741 3338 744 3338 748 3361 781 3338 886 3338 889 3361 1024 3348 1209 3348 1213 3348 1216 3348 1219 3372 1246 3345 1280 3345 1283 3345 1287 3345 1290 3345 1293 3368 1364 3341 1438 3341 1442 3341 1445 3341 1448 3341 1452 3375 1496 3375 1583 3345 1613 3341 1613 3341 1617 3341 1620 3341 1624 3341 1627 3372 1725 3348 1752 3345 1752 3345 1755 3345 1758 3345 1762 3345 1765 3368 1846 3345 1927 3345 1930 3345 1934 3345 1937 3345 1940 3348 1940 3375 1967 3345 2028 3345 2031 3345 2035 3345 2038 3345 2041 3365 2075 3348 2095 3345 2095 3345 2099 3345 2102 3345 2105 3345 2180 3345 2183 3345 2186 3348 2186 3365 2206 3345 2260 3345 2264 3345 2267 3345 2271 3345 2274 3348 2274 3378 2311 3348 2490 3348 2493 3348 2496 3375 2560 3348 2611 3348 2614 3348 2618 3348 2621 3365 2688 3348 2752 3348 2756 3348 2759 3385 2921 3426 2921 3513 2907 3486 2823 3513 2779 3513 2776 3513 2773 3513 2769 3513 2766 3513 2759 3486 2695 3510 2655 3513 2651 3513 2648 3513 2645 3513 2641 3513 2638 3513 2634 3510 2631 3486 2597 3486 2537 3506 2516 3510 2513 3513 2513 3513 2510 3513 2506 3513 2503 3513 2500 3506 2479 3486 2422 3486 2321 3503 2284 3510 2267 3510 2264 3510 2260 3510 2257 3510 2254 3506 2254 3503 2247 3483 2227 3503 2193 3510 2176 3510 2173 3510 2169 3510 2166 3510 2163 3500 2149 3483 2122 3483 2082 3500 2055 3510 2041 3510 2038 3510 2035 3510 2031 3510 2028 3510 2025 3506 2025 3500 2014 3483 1998 3483 1954 3496 1920 3510 1890 3510 1886 3510 1883 3510 1880 3510 1876 3506 1876 3496 1863 3483 1853 3496 1822 3506 1792 3506 1789 3506 1785 3506 1782 3506 1779 3503 1779 3493 1768 3479 1755 3479 1667 3479 1600 3490 1587 3503 1573 3506 1573 3506 1570 3506 1566 3506 1563 3506 1560 3506 1556 3490 1536 3479 1519 3490 1448 3506 1310 3506 1307 3506 1303 3486 1250 3479 1236 3486 1233 3503 1209 3506 1209 3506 1206 3506 1202 3510 1202 3510 1105 3506 1101 3506 1098 3506 1095 3503 1095 3483 1074 3479 1071 3483 1068 3500 1044 3500 1041 3500 1037 3503 859 3503 855 3503 852 3479 785 3503 734 3503 731 3503 727 3503 724 3476 657 3503 576 3503 572 3503 569 3503 566 3503 562 3500 562 3473 532 3469 468 3412 471"/>
</Shape>
<String CONTENT=""/>
</TextLine>
<TextLine ID="line_2" HPOS="3136" VPOS="481" WIDTH="168" HEIGHT="2433" BASELINE="3206 485 3210 2914">
<Shape>
<Polygon POINTS="3206 485 3156 485 3156 670 3156 673 3136 694 3136 697 3136 700 3136 704 3136 707 3156 754 3159 761 3159 1024 3156 1031 3136 1088 3136 1091 3136 1095 3136 1098 3156 1138 3156 1142 3136 1196 3136 1199 3136 1202 3156 1357 3156 1364 3156 1367 3136 1394 3136 1398 3136 1401 3136 1405 3136 1408 3136 1411 3152 1428 3139 1445 3136 1445 3136 1448 3136 1452 3136 1455 3136 1458 3156 1499 3136 1563 3136 1566 3136 1570 3136 1684 3136 1688 3136 1691 3139 1691 3152 1704 3136 1738 3136 1742 3136 1745 3136 1748 3136 1752 3139 1752 3156 1772 3136 1846 3136 1849 3136 1853 3136 1940 3136 1944 3136 1947 3139 1947 3152 1964 3136 1994 3136 1998 3136 2001 3136 2004 3136 2008 3156 2062 3139 2089 3139 2092 3139 2095 3139 2099 3152 2456 3136 2580 3136 2584 3136 2587 3136 2591 3139 2591 3149 2604 3139 2614 3136 2614 3136 2618 3136 2621 3136 2624 3136 2759 3136 2762 3136 2766 3156 2786 3136 2860 3136 2864 3136 2867 3136 2870 3159 2914 3210 2914 3284 2911 3301 2860 3301 2857 3301 2853 3301 2850 3301 2847 3284 2830 3270 2810 3270 2675 3284 2661 3301 2641 3301 2638 3301 2634 3301 2631 3301 2628 3301 2624 3297 2624 3284 2607 3270 2594 3270 2557 3281 2533 3301 2506 3301 2503 3301 2500 3301 2496 3301 2493 3301 2490 3297 2490 3281 2473 3270 2459 3270 2422 3281 2399 3301 2361 3301 2358 3301 2355 3301 2351 3301 2348 3281 2324 3274 2318 3281 2311 3297 2291 3301 2291 3301 2287 3301 2284 3301 2281 3301 2277 3301 2274 3281 2240 3277 2237 3281 2210 3301 2028 3301 2025 3301 2021 3277 1977 3270 1964 3277 1947 3301 1890 3301 1886 3301 1883 3301 1880 3301 1876 3277 1836 3270 1822 3270 1701 3274 1677 3301 1583 3301 1580 3301 1576 3301 1573 3301 1570 3297 1570 3274 1543 3270 1539 3274 1533 3297 1469 3297 1465 3297 1462 3297 1458 3297 1455 3294 1455 3274 1428 3270 1425 3270 1378 3274 1371 3301 1324 3301 1320 3301 1317 3301 1314 3301 1310 3301 1307 3281 1283 3301 1216 3301 1213 3301 1209 3301 1206 3270 1145 3301 1111 3301 1108 3301 1105 3301 1101 3301 1098 3274 1041 3301 950 3301 946 3301 943 3301 940 3301 936 3297 936 3274 906 3301 865 3301 862 3301 859 3304 859 3304 734 3301 731 3301 727 3274 657 3301 603 3301 599 3301 596 3301 593 3301 589 3297 589 3267 555 3264 481 3206 485"/>
</Shape>
<String CONTENT=""/>
</TextLine>
</TextBlock>

This output was produced using kraken 4.1.2 with the following code:

baseline_seg = blla.segment(im, model=model, text_direction='vertical-rl')
alto = serialization.serialize_segmentation(baseline_seg, image_name=im_path, image_size=im.size, template='alto')

I tried different text_direction values but the issue remains. Using the same segmentation model, kraken 3 and eScriptorium give me a well-formatted Alto XML document.

mittagessen commented 2 years ago

That's already fixed in 4.2. I had added multi-tagging support in the pipeline but the tests didn't catch those errors in the output serialization because it is technically still correct ALTO.