qurator-spk / sbb_textline_detection

Detect textlines in document images
Apache License 2.0
90 stars 18 forks source link

No text lines detected - Regression? #60

Closed mikegerber closed 2 years ago

mikegerber commented 2 years ago

Using https://qurator-data.de/examples/actevedef_718448162.first-page.zip, ocrd-sbb-textline-detector --overwrite -I OCR-D-IMG -O OCR-D-SEG-LINE-SBB-TLD -P model "/var/lib/textline_detection" only gives:

        <pc:Border>
            <pc:Coords points="105,80 2418,80 2418,3952 105,3952"/>
        </pc:Border>
# pip list | egrep -i 'ocrd|sbb'
ocrd                   2.38.0
ocrd-modelfactory      2.38.0
ocrd-models            2.38.0
ocrd-utils             2.38.0
ocrd-validators        2.38.0
qurator-sbb-textline   0.0.1

I'm investigating.

mikegerber commented 2 years ago

Same problem with the non-OCR-D-CLI:

sbb_textline_detector  -i OCR-D-IMG_00000024.tif -o test-out -m /home/mike/devel/qurator-data/textline_detection
mikegerber commented 2 years ago

Text regions look ok at https://github.com/qurator-spk/sbb_textline_detection/blob/master/qurator/sbb_textline_detector/main.py#L2077 but they get reset in https://github.com/qurator-spk/sbb_textline_detection/blob/master/qurator/sbb_textline_detector/main.py#L2089-L2091 - so I'm guessing the contour detection throws an exception.

mikegerber commented 2 years ago

The error module 'cv2' has no attribute 'cv2' is caught here:

https://github.com/qurator-spk/sbb_textline_detection/blob/eaf8ecd4d451c669bf7c765a338e7eb33163b414/qurator/sbb_textline_detector/main.py#L2088-L2091

I think the exception catching here is too broad and bad practice. If there's a specific exception to catch, it should be specified and that would have made it easier to track down this kind of bug - by giving a proper error message instead of silently ignoring it.

This is fixed by downgrading opencv-python-headless - the version 4.6.x from June 2022 seems to break contour detection here, therefore sbb_textline_detector is not giving any text regions and thus not giving any text lines either.

I'm preparing a PR to workaround the issue by requiring opencv-python-headless < 4.6.

👀 @kba This - the broad exception catching and the attribute error with the newest OpenCV version - might come up in eynollah too.

mikegerber commented 2 years ago

PEP8 (https://peps.python.org/pep-0008/) also has an opinion about this:

When catching exceptions, mention specific exceptions whenever possible instead of using a bare except: clause:

try:
   import platform_specific_module
except ImportError:
   platform_specific_module = None

A bare except: clause will catch SystemExit and KeyboardInterrupt exceptions, making it harder to interrupt a program with Control-C, and can disguise other problems. If you want to catch all exceptions that signal program errors, use except Exception: (bare except is equivalent to except BaseException:).

A good rule of thumb is to limit use of bare ‘except’ clauses to two cases:

If the exception handler will be printing out or logging the traceback; at least the user will be aware that an error has occurred. If the code needs to do some cleanup work, but then lets the exception propagate upwards with raise. try...finally can be a better way to handle this case.

mikegerber commented 2 years ago

A bare except: clause will catch SystemExit and KeyboardInterrupt exceptions, making it harder to interrupt a program with Control-C, and can disguise other problems.

Ah that's why I always had problems interrupting the run of this program!

mikegerber commented 2 years ago

There is still something broken, with https://qurator-data.de/examples/actevedef_718448162.first-page+binarization+segmentation.zip and

ocrd-sbb-textline-detector --overwrite -I OCR-D-IMG-BIN -O OCR-D-SEG-LINE-SBB-TLD -P model "/home/mike/devel/qurator-data/textline_detection/"

I get text regions, but there aren't any useful text lines (green) detected:

image

mikegerber commented 2 years ago

Thanks @vahidrezanezhad, I'll test it!

mikegerber commented 2 years ago

With opencv-python-headless == 4.5.1.48 (c4df3d6), it looks fine:

image