Open wollmers opened 3 years ago
Seems more that Tesseract can not handle warped lines overlapping vertically.
It guesses the baseline wrong as nearly horizontal line and not a polyline (or curve), then scans along the wrong baseline loosing the full height of characters at end of line_1_21
.
In the next line the characters of the previous line get scanned.
It's a problem of segmentation into lines and also deskewing and dewarping.
If I segment the lines into image files with another tool Tesseract gives good (CER ~4%) results with --psm 7
on the warped line. If the line is dewarped too the result is nearly perfect (1 noisy from a speckle, 1 I/J mismatch coming from training).
Indeed, the textlnes finding algorithm in Tesseract can't cope with overlapping lines.
Is this a regression, or is it a bug which exists for a long time now?
I don't think this is a regression, but without testing and comparing to previous versions, I can't say with total confidence it's not a regression.
Environment
Current Behavior:
Repeats parts of preceding or following line.
Looks like some memory constructs are not cleaned.
Expected Behavior:
Should create text straight ahead only from the same line.
Example:
Image https://github.com/wollmers/ocr-tess-issues/blob/main/issues/issue_3083_binarisation/isisvonoken1826oken_0137.jp2
Processed with
--psm 4
and variation ofthresholding_method
.Diff GRT versus OCR: https://github.com/wollmers/ocr-tess-issues/blob/main/issues/issue_3083_binarisation/isisvonoken1826oken_0137.psm4.thresh0.diff.txt
https://github.com/wollmers/ocr-tess-issues/blob/main/issues/issue_3083_binarisation/isisvonoken1826oken_0137.psm4.thresh1.diff.txt
https://github.com/wollmers/ocr-tess-issues/blob/main/issues/issue_3083_binarisation/isisvonoken1826oken_0137.psm4.thresh2.diff.txt
As another symptom the bounding boxes of the lines overlap vertically in hOCR, i. e. they are wrong calculated by Tesseract.
https://github.com/wollmers/ocr-tess-issues/blob/main/issues/issue_3083_binarisation/isisvonoken1826oken_0137.psm4.thresh2.hocr