Closed stweil closed 6 months ago
If I remove all lines with have a WIDTH of 1 , 2 or 3, the recognition works for the remaining lines without an exception. There are also some lines with a WIDTH of 0, but those don't cause an exception.
The line is invalid and should be skipped in the recognizer but this case isn't caught. BASELINE="193 1557 193 1557"
is only a point, so can't be processed. I'll push a patch later today.
BTW WIDTH
is completely ignored by the line extractor. The baseline and boundary are the important bits.
Where do these lines come from anyway? The segmenter filters out extremely short line segments like these and IIRC the eScriptorium UI would make drawing point-sized line segments very difficult.
I think the user created those lines accidentally by manually clicking in the eScriptorium panel where it's possible to add, change or delete baselines. Maybe it's sufficient to click without drawing, and that will add a "baseline" point.
I can confirm that the recognition works if I only remove the two lines where the baseline is a point from the ALTO file.
... and I was able to add a baseline which zero length. I could not create it directly, but it is possible to change an existing baseline with two points so that both points are on the same position.
There was a report in the eScriptorium Gitter chat about a failing recognition with a certain image. With the provided export (export_doc1_consular_cards_1_alto_202405140257.zip) it is not only possible to reproduce the issue in eScriptorium, but also with latest
kraken
on the command line.I modified kraken.py to get a full exception backtrace and found that this part of the ALTO XML triggers the exception:
Normally kraken would process lots of lines before handling that fatal line, but when I move that line to the first place it gets the exception early: