ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.42k stars 591 forks source link

Segmentation Deletes lowercase letters #307

Closed tushar1328 closed 5 years ago

tushar1328 commented 6 years ago

Problem

When the *.pseg.png is generated all the letters which are below written Just like y,g,p, are not detected in the file need the solution to that See the word Players where lowercase "y" is missing. Need solution to this. tried changing the --vscale but no changes in the file is being observed. Tried changing the scale but that too failed. any other option for changing the scale.

Input File

0001 bin

Output File

0001 pseg

Steps to Reproduce (for bugs)

1../run-test

Your Environment

lehzwo commented 6 years ago

I think i found the problem but can't think of a solution that is easy to implement by now.

The Problem:

Your idea to change the value of the scale parameter was fine, since the characters are removed in the function _removehlines.

With an estimated scale of ~31, all objects are deleted that have a width >~310 px. These objects are detected based on connected black pixels.

As I took a closer look at the removed characters I recognized that they are connected to the long horizontal lines. Because of that each horizontal line and its connected characters are interpreted as a single object which is then removed by the _removehlines function since the objects width is greater than ~310px.

zuphilip commented 5 years ago

@tushar1328 What is the status of this issue?

kba commented 5 years ago

I would think that it could be solved relatively easily for this particular type of image with a preprocessing step (iterate pixel lines, if X consecutive lines are more than Y percent black, flip the first one white to disconnect letter descenders from that line). Generalizing such a heuristic would probably cause new problems and require extensive testing.