mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
724 stars 130 forks source link

recognizer: Shapely problem, unspecific error message #565

Closed bertsky closed 6 months ago

bertsky commented 8 months ago

When using the recognizer via CLI on PAGE files, I sometimes get failures like this:

Writing recognition results for GN 1771,4 (GN.A.158)/552_202cd_default.xml  [01/24/24 12:28:34] ERROR    Failed processing GN 1771,4 (GN.A.158)/552_202cd_default.xml: A LinearRing must have at least 3    kraken.py:418
                             coordinate tuples

Unfortunately, the exception is caught way to high up – the file name is pretty worthless in this case. (And I did not find any obviously broken coordinates, except for Escriptorium's idiotic dummyblock region-level coords Coords points="0,0 0,0", but does that matter here?)

Without a proper stacktrace I have trouble finding where the error originates. But I suspect if we modify parse_page._parse_coords to try to instantiate a Polygon or LinearRing, then we would see the earlier exception with the @id in the message...

mittagessen commented 8 months ago

Can you run it with --raise-on-error? That should give you a proper stack trace and me something to work with.

but does that matter here?

I doubt it.

bertsky commented 8 months ago

Can you run it with --raise-on-error? That should give you a proper stack trace and me something to work with.

Oh, wow! That looks really pretty:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in shapely.speedups._speedups.geos_linearring_from_py:252                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'list' object has no attribute '__array_interface__'

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /local/ocr-d/ocrd_all/venv/bin/kraken:8 in <module>                                              │
│                                                                                                  │
│   5 from kraken.kraken import cli                                                                │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(cli())                                                                          │
│   9                                                                                              │
│                                                                                                  │
│ /local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/click/core.py:1157 in __call__            │
│                                                                                                  │
│ /local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/click/core.py:1078 in main                │
│                                                                                                  │
│ /local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/click/core.py:1720 in invoke              │
│                                                                                                  │
│ /local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/click/core.py:1657 in _process_result     │
│                                                                                                  │
│ /local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/click/core.py:783 in invoke               │
│                                                                                                  │
│ /local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/kraken/kraken.py:416 in process_pipeline  │
│                                                                                                  │
│   413 │   │   │   for idx, (task, input, output) in enumerate(zip(subcommands, fc, fc[1:])):     │
│   414 │   │   │   │   if len(fc) - 2 == idx:                                                     │
│   415 │   │   │   │   │   ctx.meta['last_process'] = True                                        │
│ ❱ 416 │   │   │   │   task(input=input, output=output)                                           │
│   417 │   │   except Exception as e:                                                             │
│   418 │   │   │   logger.error(f'Failed processing {io_pair[0]}: {str(e)}')                      │
│   419 │   │   │   if ctx.meta['raise_failed']:                                                   │
│                                                                                                  │
│ /local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/kraken/kraken.py:254 in recognizer        │
│                                                                                                  │
│   251 │   │   logger.info('Serializing as {} into {}'.format(ctx.meta['output_mode'], output))   │
│   252 │   │   if ctx.meta['output_mode'] != 'native':                                            │
│   253 │   │   │   from kraken import serialization                                               │
│ ❱ 254 │   │   │   fp.write(serialization.serialize(records=preds,                                │
│   255 │   │   │   │   │   │   │   │   │   │   │    image_name=ctx.meta['base_image'],            │
│   256 │   │   │   │   │   │   │   │   │   │   │    image_size=Image.open(ctx.meta['base_image'   │
│   257 │   │   │   │   │   │   │   │   │   │   │    writing_mode=ctx.meta['text_direction'],      │
│                                                                                                  │
│ /local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/kraken/serialization.py:134 in serialize  │
│                                                                                                  │
│   131 │   if regions is not None:                                                                │
│   132 │   │   for id, regs in regions.items():                                                   │
│   133 │   │   │   for reg in regs:                                                               │
│ ❱ 134 │   │   │   │   region_map[idx] = (id, geom.Polygon(reg), reg)                             │
│   135 │   │   │   │   idx += 1                                                                   │
│   136 │                                                                                          │
│   137 │   # build region and line type dict                                                      │
│                                                                                                  │
│ /local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/shapely/geometry/polygon.py:261 in        │
│ __init__                                                                                         │
│                                                                                                  │
│   258 │   │   BaseGeometry.__init__(self)                                                        │
│   259 │   │                                                                                      │
│   260 │   │   if shell is not None:                                                              │
│ ❱ 261 │   │   │   ret = geos_polygon_from_py(shell, holes)                                       │
│   262 │   │   │   if ret is not None:                                                            │
│   263 │   │   │   │   geom, n = ret                                                              │
│   264 │   │   │   │   self._set_geom(geom)                                                       │
│                                                                                                  │
│ /local/ocr-d/ocrd_all/venv/lib/python3.8/site-packages/shapely/geometry/polygon.py:539 in        │
│ geos_polygon_from_py                                                                             │
│                                                                                                  │
│   536 │   │   return geos_geom_from_py(shell)                                                    │
│   537 │                                                                                          │
│   538 │   if shell is not None:                                                                  │
│ ❱ 539 │   │   ret = geos_linearring_from_py(shell)                                               │
│   540 │   │   if ret is None:                                                                    │
│   541 │   │   │   return None                                                                    │
│   542                                                                                            │
│                                                                                                  │
│ in shapely.speedups._speedups.geos_linearring_from_py:346                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: A LinearRing must have at least 3 coordinate tuples

So, at a glance, the dummy region does seem to be the culprit (when writing back the output).

I'll try reprojecting the region coords from the hull of the lines and report back. (But then how do Escriptorium users deal with this?)

mittagessen commented 8 months ago

That's a regression from the new container classes/serialization/everything code.

(But then how do Escriptorium users deal with this?)

I'll fix it tomorrow.

bertsky commented 8 months ago

Note: it also happens with coordinates that do have more than 3 points syntactically but will have Shapely collapse duplicates:

       <pc:TextRegion id="eSc_textblock_7ed05754">
            <pc:Coords points="331,203 331,203 433,203 433,203"/>
mittagessen commented 6 months ago

It's been fixed in 5.0.