tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.22k stars 9.51k forks source link

FilterMusic has false positives in documents with tables #1255

Open akhudek opened 6 years ago

akhudek commented 6 years ago

Environment

Current Behavior:

The table header and parts of the cell contents are not recognized due to the binarized version of the document having critical parts of the table removed. test-table testb

Expected Behavior:

If you disable FilterMusic, it works fine. testb2

Suggested Fix:

It's not clear to me what the motivation was for trying to find and identify sheet music. Is it possibly for the table detector? If so, given that the table detection is fairly poor anyways and the extraction isn't complete, I'd suggest just removing this music filter code. Alternately, maybe expose an option to disable it?

zdenop commented 5 years ago

Now it is possible to use pageseg_apply_music_mask e.g. with -c pageseg_apply_music_mask=0

stweil commented 5 years ago

I wonder whether the current default is reasonable. If most documents give better results when this parameter is false, then that should be the default.

zdenop commented 5 years ago

IMO it will need more investigations:

bertsky commented 3 years ago

I've just come across this through a lot of painful debugging of table-rich images. I had never heard of pageseg_apply_music_mask until I found it in the code – it kind of smiled back at me and said "I'm all here!".

@stweil @zdenop IMO tables are much more important/prevalent than music sheets, and for the latter you'd probably need better/specialized software anyway. So I would argue the default should be changed. In the very least, this should be well documented and made more visible (perhaps by shipping a config file for it).

BTW, the other problem with line detection (there's more of course) seems to be the proximity between separators and text in low-density images – this is even harder than the music/not-music decision, because you can easily get false positive horizontal or vertical lines in normal text. (Pragmatically you can upsample such images as a workaround, therefore perhaps the code can be made more robust by using additional filters or optimise some parameters...)

Here's an example that also shows how naive the music sheet detection is: vhlinefinding_img

bertsky commented 3 years ago

So I would argue the default should be changed. In the very least, this should be well documented and made more visible (perhaps by shipping a config file for it).

Also, I don't think simply suppressing the staves is the correct thing to do. They should (also) be polygonalized and used as partitions during layout analysis, to be presented as (special – PT.NOISE or PT.MUSIC) regions (perhaps even with text lines in between). We could partly reuse code that tailors PT.HORZ and PT.VERT from the separator masks. But I guess that is a separate issue already.

wollmers commented 3 years ago

@bertsky Interesting. The false positive glyphs in the mask are all connected to the horizontal separators. This does not look like a binarisation problem or "overinking". Looks more like caused by morphological transformation that connects glyphs horizontal and vertical too much. Maybe an internal parameter.

See the rectangular shape of the connections:

Bildschirmfoto 2021-09-24 um 20 49 13

stweil commented 3 years ago

So I would argue the default should be changed.

I agree, see the comment above which I wrote two years ago. We could not do that for Tesseract 4 for compatibility reasons, but we are free to do it now for Tesseract 5.

bertsky commented 3 years ago

I agree, see the comment above which I wrote two years ago. We could not do that for Tesseract 4 for compatibility reasons, but we are free to do it now for Tesseract 5.

Splendid. Let's do it then! Is there any documentation regarding staff detection which would neet to be changed?

(Also, the additional aspects of how to provide a useful API for the music regions, and fix the above seen false positives, should be tracked in a separate issue...)

amitdo commented 3 years ago

I am not able to reproduce the issue with the provided image with the latest commit from the main branch.

stweil commented 3 years ago

@akhudek, can you still reproduce this issue?

akhudek commented 3 years ago

I passed on ownership of our tesseract work a few years ago, but I’ve copied the new owner in case he’s still tracking this.

On Oct 25, 2021 at 4:34:14 PM, Stefan Weil @.***> wrote:

@akhudek https://github.com/akhudek, can you still reproduce this issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1255#issuecomment-951303094, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABEBMMPBODS52FSB3AWC7DUIW5MNANCNFSM4EKAVASQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

stweil commented 3 years ago

The comment https://github.com/tesseract-ocr/tesseract/pull/2732#issuecomment-546703111 has an example where -c pageseg_apply_music_mask=0 improved the OCR. Therefore that is now the default.

The bad news is that there seems to be a memory leak related to this default:

valgrind --leak-check=full --track-origins=yes tesseract 67636651-db58da80-f8a0-11e9-9d44-45bcff5f1609.jpg - -c pageseg_apply_music_mask=0 -l fast/script/Latin
==298386== Memcheck, a memory error detector
==298386== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==298386== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==298386== Command: tesseract 67636651-db58da80-f8a0-11e9-9d44-45bcff5f1609.jpg - -c pageseg_apply_music_mask=0 -l fast/script/Latin
==298386== 
[...]
==298376== 
==298376== HEAP SUMMARY:
==298376==     in use at exit: 79,328 bytes in 4 blocks
==298376==   total heap usage: 567,692 allocs, 567,688 frees, 498,788,111 bytes allocated
==298376== 
==298376== 39,664 (64 direct, 39,600 indirect) bytes in 1 blocks are definitely lost in loss record 3 of 4
==298376==    at 0x483AB65: calloc (vg_replace_malloc.c:760)
==298376==    by 0x49AC486: pixCreateHeader (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.4)
==298376==    by 0x49AD33E: pixCreateNoInit (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.4)
==298376==    by 0x49AD587: pixCreateTemplateNoInit (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.4)
==298376==    by 0x49AD65B: pixCreateTemplate (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.4)
==298376==    by 0x49ADAD7: pixCopy (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.4)
==298376==    by 0x49B696A: pixAnd (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.4)
==298376==    by 0x1C9800: tesseract::Image::operator&(tesseract::Image) const (image.cpp:52)
==298376==    by 0x253AA9: tesseract::LineFinder::GetLineMasks(int, tesseract::Image, tesseract::Image*, tesseract::Image*, tesseract::Image*, tesseract::Image*, tesseract::Image*, tesseract::Image*, Pixa*) (linefind.cpp:640)
==298376==    by 0x254FA2: tesseract::LineFinder::FindAndRemoveLines(int, bool, tesseract::Image, int*, int*, tesseract::Image*, tesseract::TabVector_LIST*, tesseract::TabVector_LIST*) (linefind.cpp:254)
==298376==    by 0x1619D6: tesseract::Tesseract::SetupPageSegAndDetectOrientation(tesseract::PageSegMode, tesseract::BLOCK_LIST*, tesseract::Tesseract*, tesseract::OSResults*, tesseract::TO_BLOCK_LIST*, tesseract::Image*, tesseract::Image*) (pagesegmain.cpp:288)
==298376==    by 0x162286: tesseract::Tesseract::AutoPageSeg(tesseract::PageSegMode, tesseract::BLOCK_LIST*, tesseract::TO_BLOCK_LIST*, tesseract::BLOBNBOX_LIST*, tesseract::Tesseract*, tesseract::OSResults*) (pagesegmain.cpp:209)
==298376== 
==298376== 39,664 (64 direct, 39,600 indirect) bytes in 1 blocks are definitely lost in loss record 4 of 4
==298376==    at 0x483AB65: calloc (vg_replace_malloc.c:760)
==298376==    by 0x49AC486: pixCreateHeader (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.4)
==298376==    by 0x49AD33E: pixCreateNoInit (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.4)
==298376==    by 0x49AD587: pixCreateTemplateNoInit (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.4)
==298376==    by 0x49AD65B: pixCreateTemplate (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.4)
==298376==    by 0x49ADAD7: pixCopy (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.4)
==298376==    by 0x49B696A: pixAnd (in /usr/lib/x86_64-linux-gnu/liblept.so.5.0.4)
==298376==    by 0x1C9800: tesseract::Image::operator&(tesseract::Image) const (image.cpp:52)
==298376==    by 0x254FEC: tesseract::LineFinder::FindAndRemoveLines(int, bool, tesseract::Image, int*, int*, tesseract::Image*, tesseract::TabVector_LIST*, tesseract::TabVector_LIST*) (linefind.cpp:262)
==298376==    by 0x1619D6: tesseract::Tesseract::SetupPageSegAndDetectOrientation(tesseract::PageSegMode, tesseract::BLOCK_LIST*, tesseract::Tesseract*, tesseract::OSResults*, tesseract::TO_BLOCK_LIST*, tesseract::Image*, tesseract::Image*) (pagesegmain.cpp:288)
==298376==    by 0x162286: tesseract::Tesseract::AutoPageSeg(tesseract::PageSegMode, tesseract::BLOCK_LIST*, tesseract::TO_BLOCK_LIST*, tesseract::BLOBNBOX_LIST*, tesseract::Tesseract*, tesseract::OSResults*) (pagesegmain.cpp:209)
==298376==    by 0x162770: tesseract::Tesseract::SegmentPage(char const*, tesseract::BLOCK_LIST*, tesseract::Tesseract*, tesseract::OSResults*) (pagesegmain.cpp:140)
==298376== 
==298376== LEAK SUMMARY:
==298376==    definitely lost: 128 bytes in 2 blocks
==298376==    indirectly lost: 79,200 bytes in 2 blocks
==298376==      possibly lost: 0 bytes in 0 blocks
==298376==    still reachable: 0 bytes in 0 blocks
==298376==         suppressed: 0 bytes in 0 blocks
==298376== 
==298376== For lists of detected and suppressed errors, rerun with: -s
==298376== ERROR SUMMARY: 3672394 errors from 486 contexts (suppressed: 9697540 from 120)
stweil commented 3 years ago

The two memory leaks are fixed by commit b4e4e00653b22a6bd6fce2ae37188ef80da6f3ad.