mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
750 stars 131 forks source link

Recognition aborts at "baselines" which are only a point #606

Closed stweil closed 6 months ago

stweil commented 6 months ago

There was a report in the eScriptorium Gitter chat about a failing recognition with a certain image. With the provided export (export_doc1_consular_cards_1_alto_202405140257.zip) it is not only possible to reproduce the issue in eScriptorium, but also with latest kraken on the command line.

I modified kraken.py to get a full exception backtrace and found that this part of the ALTO XML triggers the exception:

          <TextLine ID="eSc_line_fd882965"
                    BASELINE="193 1557 193 1557" 
                    HPOS="190"
                    VPOS="1523"
                    WIDTH="3"
                    HEIGHT="46">
           <Shape><Polygon POINTS="192 1555 190 1525 193 1523 193 1569 192 1569"/></Shape>
           <String CONTENT=""
                    HPOS="190"
                    VPOS="1523"
                    WIDTH="3"
                    HEIGHT="46"></String>
          </TextLine>

Normally kraken would process lots of lines before handling that fatal line, but when I move that line to the first place it gets the exception early:

% kraken -f alto -i 00001-00020.pdf_page_12.xml text -vvvv ocr -m german_handwriting.mlmodel
scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.1.2 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
[05/14/24 13:54:13] INFO     Loading model from /Users/stweil/Library/Application Support/kraken/german_handwriting.mlmodel                           models.py:209
[05/14/24 13:54:17] DEBUG    layer           type    params                                                                                             vgsl.py:171
                    DEBUG    0               conv    kernel 3 x 13 filters 32 stride (1, 1) dilation (1, 1) activation r                                vgsl.py:641
                    DEBUG    1               dropout probability 0.1 dims 2                                                                             vgsl.py:521
                    DEBUG    2               maxpool kernel 2 x 2 stride 2 x 2                                                                          vgsl.py:663
                    DEBUG    3               conv    kernel 3 x 13 filters 32 stride (1, 1) dilation (1, 1) activation r                                vgsl.py:641
                    DEBUG    4               dropout probability 0.1 dims 2                                                                             vgsl.py:521
                    DEBUG    5               maxpool kernel 2 x 2 stride 2 x 2                                                                          vgsl.py:663
                    DEBUG    6               conv    kernel 3 x 9 filters 64 stride (1, 1) dilation (1, 1) activation r                                 vgsl.py:641
                    DEBUG    7               dropout probability 0.1 dims 2                                                                             vgsl.py:521
                    DEBUG    8               maxpool kernel 2 x 2 stride 2 x 2                                                                          vgsl.py:663
                    DEBUG    9               conv    kernel 3 x 9 filters 64 stride (1, 1) dilation (1, 1) activation r                                 vgsl.py:641
                    DEBUG    10              dropout probability 0.1 dims 2                                                                             vgsl.py:521
                    DEBUG    11              reshape from 1 1 x -1 to 1/3                                                                               vgsl.py:699
                    DEBUG    12              rnn     direction b transposed False summarize False out 200 legacy None                                   vgsl.py:503
                    DEBUG    13              dropout probability 0.1 dims 2                                                                             vgsl.py:521
                    DEBUG    14              rnn     direction b transposed False summarize False out 200 legacy None                                   vgsl.py:503
                    DEBUG    15              dropout probability 0.5 dims 1                                                                             vgsl.py:521
                    DEBUG    16              rnn     direction b transposed False summarize False out 200 legacy None                                   vgsl.py:503
                    DEBUG    17              dropout probability 0.5 dims 1                                                                             vgsl.py:521
                    DEBUG    18              linear  augmented False out 295                                                                            vgsl.py:743
                    DEBUG    Deserializing layer  with type <class 'kraken.lib.layers.MultiParamSequential'>                                            vgsl.py:291
                    DEBUG    Deserializing layer C_0 with type <class 'kraken.lib.layers.ActConv2D'>                                                    vgsl.py:291
                    DEBUG    Deserializing layer Do_1 with type <class 'kraken.lib.layers.Dropout'>                                                     vgsl.py:291
[05/14/24 13:54:18] DEBUG    Deserializing layer Mp_2 with type <class 'kraken.lib.layers.MaxPool'>                                                     vgsl.py:291
                    DEBUG    Deserializing layer C_3 with type <class 'kraken.lib.layers.ActConv2D'>                                                    vgsl.py:291
                    DEBUG    Deserializing layer Do_4 with type <class 'kraken.lib.layers.Dropout'>                                                     vgsl.py:291
[05/14/24 13:54:19] DEBUG    Deserializing layer Mp_5 with type <class 'kraken.lib.layers.MaxPool'>                                                     vgsl.py:291
                    DEBUG    Deserializing layer C_6 with type <class 'kraken.lib.layers.ActConv2D'>                                                    vgsl.py:291
[05/14/24 13:54:20] DEBUG    Deserializing layer Do_7 with type <class 'kraken.lib.layers.Dropout'>                                                     vgsl.py:291
                    DEBUG    Deserializing layer Mp_8 with type <class 'kraken.lib.layers.MaxPool'>                                                     vgsl.py:291
                    DEBUG    Deserializing layer C_9 with type <class 'kraken.lib.layers.ActConv2D'>                                                    vgsl.py:291
[05/14/24 13:54:21] DEBUG    Deserializing layer Do_10 with type <class 'kraken.lib.layers.Dropout'>                                                    vgsl.py:291
                    DEBUG    Deserializing layer S_11 with type <class 'kraken.lib.layers.Reshape'>                                                     vgsl.py:291
[05/14/24 13:54:22] DEBUG    Deserializing layer L_12 with type <class 'kraken.lib.layers.TransposedSummarizingRNN'>                                    vgsl.py:291
                    DEBUG    Deserializing layer Do_13 with type <class 'kraken.lib.layers.Dropout'>                                                    vgsl.py:291
[05/14/24 13:54:23] DEBUG    Deserializing layer L_14 with type <class 'kraken.lib.layers.TransposedSummarizingRNN'>                                    vgsl.py:291
                    DEBUG    Deserializing layer Do_15 with type <class 'kraken.lib.layers.Dropout'>                                                    vgsl.py:291
[05/14/24 13:54:24] DEBUG    Deserializing layer L_16 with type <class 'kraken.lib.layers.TransposedSummarizingRNN'>                                    vgsl.py:291
                    DEBUG    Deserializing layer Do_17 with type <class 'kraken.lib.layers.Dropout'>                                                    vgsl.py:291
[05/14/24 13:54:25] DEBUG    Deserializing layer O_18 with type <class 'kraken.lib.layers.LinSoftmax'>                                                  vgsl.py:291
[05/14/24 13:54:26] INFO     TextLine eSc_line_451906b0 without polygon                                                                                  xml.py:191
                    INFO     Running 1 multi-script recognizers on                                                                                     rpred.py:130
                             /Users/stweil/src/github/mittagessen/kraken/export_doc1_consular_cards_1_alto_202405140257/00001-00020.pdf_page_12.png                
                             with 154 lines                                                                                                                        
                    DEBUG    Loading line transforms for ('type', 'default')                                                                           rpred.py:147

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/stweil/src/github/mittagessen/venv3.9/bin/kraken:8 in <module>                            │
│                                                                                                  │
│   5 from kraken.kraken import cli                                                                │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(cli())                                                                          │
│   9                                                                                              │
│                                                                                                  │
│ /Users/stweil/src/github/mittagessen/venv3.9/lib/python3.9/site-packages/click/core.py:1157 in   │
│ __call__                                                                                         │
│                                                                                                  │
│ /Users/stweil/src/github/mittagessen/venv3.9/lib/python3.9/site-packages/click/core.py:1078 in   │
│ main                                                                                             │
│                                                                                                  │
│ /Users/stweil/src/github/mittagessen/venv3.9/lib/python3.9/site-packages/click/core.py:1720 in   │
│ invoke                                                                                           │
│                                                                                                  │
│ /Users/stweil/src/github/mittagessen/venv3.9/lib/python3.9/site-packages/click/core.py:1657 in   │
│ _process_result                                                                                  │
│                                                                                                  │
│ /Users/stweil/src/github/mittagessen/venv3.9/lib/python3.9/site-packages/click/core.py:783 in    │
│ invoke                                                                                           │
│                                                                                                  │
│ /Users/stweil/src/github/mittagessen/venv3.9/lib/python3.9/site-packages/kraken/kraken.py:430 in │
│ process_pipeline                                                                                 │
│                                                                                                  │
│   427 │   │   │   │   if len(fc) - 2 == idx:                                                     │
│   428 │   │   │   │   │   ctx.meta['last_process'] = True                                        │
│   429 │   │   │   │   with threadpool_limits(limits=ctx.meta['threads']):                        │
│ ❱ 430 │   │   │   │   │   task(input=input, output=output)                                       │
│   431 │   │   # except Exception as e:                                                           │
│   432 │   │   #    logger.error(f'Failed processing {io_pair[0]}: {str(e)}')                     │
│   433 │   │   #    if ctx.meta['raise_failed']:                                                  │
│                                                                                                  │
│ /Users/stweil/src/github/mittagessen/venv3.9/lib/python3.9/site-packages/kraken/kraken.py:242 in │
│ recognizer                                                                                       │
│                                                                                                  │
│   239 │                                                                                          │
│   240 │   with KrakenProgressBar() as progress:                                                  │
│   241 │   │   pred_task = progress.add_task('Processing', total=len(it), visible=True if not c   │
│ ❱ 242 │   │   for pred in it:                                                                    │
│   243 │   │   │   preds.append(pred)                                                             │
│   244 │   │   │   progress.update(pred_task, advance=1)                                          │
│   245 │   results = dataclasses.replace(it.bounds, lines=preds, imagename=ctx.meta['base_image   │
│                                                                                                  │
│ /Users/stweil/src/github/mittagessen/venv3.9/lib/python3.9/site-packages/kraken/rpred.py:300 in  │
│ __next__                                                                                         │
│                                                                                                  │
│   297 │   │   │   return rec.display_order(None)                                                 │
│   298 │                                                                                          │
│   299 │   def __next__(self):                                                                    │
│ ❱ 300 │   │   return self.next_iter(next(self.line_iter))                                        │
│   301 │                                                                                          │
│   302 │   def __iter__(self):                                                                    │
│   303 │   │   return self                                                                        │
│                                                                                                  │
│ /Users/stweil/src/github/mittagessen/venv3.9/lib/python3.9/site-packages/kraken/rpred.py:255 in  │
│ _recognize_baseline_line                                                                         │
│                                                                                                  │
│   252 │   │   use_legacy_polygons = self._choose_legacy_polygon_extractor(net)                   │
│   253 │   │                                                                                      │
│   254 │   │   try:                                                                               │
│ ❱ 255 │   │   │   box, coords = next(extract_polygons(self.im, seg, legacy=use_legacy_polygons   │
│   256 │   │   except KrakenInputException as e:                                                  │
│   257 │   │   │   logger.warning(f'Extracting line failed: {e}')                                 │
│   258 │   │   │   return BaselineOCRRecord('', [], [], line)                                     │
│                                                                                                  │
│ /Users/stweil/src/github/mittagessen/venv3.9/lib/python3.9/site-packages/kraken/lib/segmentation │
│ .py:1236 in extract_polygons                                                                     │
│                                                                                                  │
│   1233 │   │   │   │   │   control_pts = []                                                      │
│   1234 │   │   │   │   │   for point in pl.geoms:                                                │
│   1235 │   │   │   │   │   │   npoint = np.array(point.coords)[0]                                │
│ ❱ 1236 │   │   │   │   │   │   line_idx, dist, intercept = min(((idx, line.project(point),       │
│   1237 │   │   │   │   │   │   │   │   │   │   │   │   │   │   np.array(line.interpolate(line.p  │
│   1238 │   │   │   │   │   │   │   │   │   │   │   │   │   │   key=lambda x: np.linalg.norm(npo  │
│   1239 │   │   │   │   │   │   # absolute distance from start of line                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: min() arg is an empty sequence
stweil commented 6 months ago

If I remove all lines with have a WIDTH of 1 , 2 or 3, the recognition works for the remaining lines without an exception. There are also some lines with a WIDTH of 0, but those don't cause an exception.

mittagessen commented 6 months ago

The line is invalid and should be skipped in the recognizer but this case isn't caught. BASELINE="193 1557 193 1557" is only a point, so can't be processed. I'll push a patch later today.

BTW WIDTH is completely ignored by the line extractor. The baseline and boundary are the important bits.

mittagessen commented 6 months ago

Where do these lines come from anyway? The segmenter filters out extremely short line segments like these and IIRC the eScriptorium UI would make drawing point-sized line segments very difficult.

stweil commented 6 months ago

I think the user created those lines accidentally by manually clicking in the eScriptorium panel where it's possible to add, change or delete baselines. Maybe it's sufficient to click without drawing, and that will add a "baseline" point.

stweil commented 6 months ago

I can confirm that the recognition works if I only remove the two lines where the baseline is a point from the ALTO file.

stweil commented 6 months ago

... and I was able to add a baseline which zero length. I could not create it directly, but it is possible to change an existing baseline with two points so that both points are on the same position.