qurator-spk / eynollah

Document Layout Analysis
Apache License 2.0
328 stars 26 forks source link

Possible Bug in separate_lines.py #124

Open LucPol98 opened 3 months ago

LucPol98 commented 3 months ago

Hi, I think I have spotted a bug in the separate_lines.py file, however, I would like confirmation from the creators of the code. Specifically, within this file some peaks are removed and to do so the list "clusters_to_be_deleted" is created as follows:

diff_arg_neg_must_be_deleted = np.diff(arg_neg_must_be_deleted)
arg_diff = np.array(range(len(diff_arg_neg_must_be_deleted)))
arg_diff_cluster = arg_diff[diff_arg_neg_must_be_deleted > 1]

...

if len(arg_diff_cluster) >= 2 and len(arg_diff_cluster) > 0: 
    clusters_to_be_deleted.append(
            arg_neg_must_be_deleted[0 : arg_diff_cluster[0] + 1]
        )
        for i in range(len(arg_diff_cluster) - 1):
            clusters_to_be_deleted.append(
                arg_neg_must_be_deleted[
                    arg_diff_cluster[i] + 1 : arg_diff_cluster[i + 1] + 1
                ]
            )
        clusters_to_be_deleted.append(
            arg_neg_must_be_deleted[arg_diff_cluster[len(arg_diff_cluster) - 1] + 1 :]
        )
    elif len(arg_neg_must_be_deleted) >= 2 and len(arg_diff_cluster) == 0:
        clusters_to_be_deleted.append(arg_neg_must_be_deleted[:])
    if len(arg_neg_must_be_deleted) == 1:
        clusters_to_be_deleted.append(arg_neg_must_be_deleted)

In any case, however, the first IF may have been set up wrongly in that it double checks the length of the same previously set list (arg_diff_cluster), while in the other ELIFs the length of two different lists is checked (arg_neg_must_be_deleted and arg_diff_cluster) . Wasn't it perhaps intended to set the first IF with the two different lists? Or can the first IF be reduced to only one checking of: if len(arg_diff_cluster) >= 2:

NOTICE: This portion of code is repeated twice in this file, so any corrections should be made in both of them.

Thank you :)

cneud commented 3 months ago

Thanks a lot for reporting this @LucPol98!

@vahidrezanezhad can you please have a look at the described issue and maybe also check why the code is repeated in two separate places?

LucPol98 commented 3 months ago

The code is repeated twice because it is used in two different functions and I think that is intentional. I only pointed out the fact that it was in two different places because the correction I think should be done everywhere. Maybe you could make it a common function that is called (as I did in my repo edit of yours). It is also not the only point that has excessive redundancies. For example, in almost every case where there are cv2.moments or cv2.findContours, the same pre-processing and post-processing of the result almost always happens, and even then I collected it in one function to better handle the issues. Other places in the code that process the results of the networks also repeat and make the files excessively long. Obviously mine are not critics as everyone programs as they prefer and your Repo is fantastic for what it does! I am only pointing this out to you because having more orderly and modularized code might help you in identifying problems :)

If I find any other critical issues I will let you know

vahidrezanezhad commented 3 months ago

Thanks a lot for reporting this @LucPol98!

@vahidrezanezhad can you please have a look at the described issue and maybe also check why the code is repeated in two separate places?

Sure, I'll address the described issue and investigate the code duplication.