ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.42k stars 591 forks source link

Page segmentation deletes some parts of the text, how to avoid? #259

Open franvillamil opened 6 years ago

franvillamil commented 6 years ago

I am using ocropy to extract data from old documents that list electoral results. These are big pages arranged in up to 4 colums, from which I have taken screenshots of each column. See a sample of the raw image (I know the quality is very bad, but is basically running OCR on this or copying it entirely by hand): https://user-images.githubusercontent.com/3774527/32782400-9111bab6-c948-11e7-9ea6-6266cc828627.png

To avoid problems, I'm trying to make ocropy read the text as a one-column text (see issue #240), after deleting every and so far it's more or less going well. In some cases, however, ocropy is deleting some parts (mainly numbers) when it does the segmentation. See below two screenshorts of the original binary file and the segmentation output (the .nrm.png files):

Missing some part of '207', original binary: ocropy1a Image after segmentation: ocropy1a

The '3' is completely removed, original binary: ocropy2a After segmentation: ocropy2b

In some cases (e.g. when there are a few 1s, see below), it seems ocropy thinks these are black lines delimiting columns and tries to ignore them. But I don't want to do this, as I'm removing every black line that could be mistaken in Gimp.

Several 1s together might appear like a black line?: ocropy3

Solution?

Does anyone know if there is any piece of code I can modify to avoid this? I've been looking into the gpageseg code but haven't found anything.

Your Environment

lehzwo commented 6 years ago

Hey @franvillamil,

could you please provide a whole example page? It is hard to reproduce the error with an image of a single column.

urhub commented 6 years ago

Several 1s together might appear like a black line? It probably could. I know that it needs "black line separators" of height 20*scale or 20*xheight to recognize it as delimiting a column, please see issue #250 .

zuphilip commented 6 years ago

During page segmentation there is also a step which deletes small component remove_noise and another one which deletes vertical lines remove_hlines. Moreover, it is possible that some characters interfer with the column segmentation itself.

However, as already @lehzwo mention, I also cannot reproduce your issues with the images you provide and I don't know whether you use any special parameter during the call.

@franvillamil if you provide more information such that we can reproduce the issue, then we can look here again, otherwise I suggest to close this issue.

ChillarAnand commented 6 years ago

I also faced this issue. Here are some sample images.

Commands used to process the image

pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-nlbin /home/chillaranand/projects/ocr/data/vishadam-021.png -o output -n 
pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-gpageseg output/????.bin.png -n 
pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-rpred -Q 4 -m /home/chillaranand/projects/ocr/ocropy/models/te.pyrnn.gz output/????.bin.png -n
zuphilip commented 6 years ago

@ChillarAnand Okay, it looks that for your examples the parts below the baseline (descenders?) are larger than expected. The computed lines look then like this:

_lineseeds

and lines from the descender part are then neglected. Try to adjust the vscale/scale manually, e.g.

./ocropus-gpageseg temp/chillar0001.bin.png --debug -n --vscale 1.5

which should work well, except for the page number on the top left (but I think this is a known issue).

ChillarAnand commented 6 years ago

Thank you @zuphilip. With --vscale it is segmenting correctly. How did you generate the image above?

zuphilip commented 6 years ago

How did you generate the image above?

The --debug option produces such pictures in the folder where you are calling the script.

Shreeshrii commented 4 years ago

03

I tried the above image with different --vscale values. With --vscale 2.0 it gets all the text but gets 2 lines per image (ie. gets 13 images instead of 26). Without it, it gets only 2 lines in the whole page.

 ocropus-gpageseg 'book/????.bin.png' --debug -n --vscale 2.0  --maxcolseps 0 --maxseps 0
INFO:
INFO:  ########## /usr/local/bin/ocropus-gpageseg book/????.bin.png --debug -n
INFO:
INFO:  book/0001.bin.png
INFO:  scale 57.236352
INFO:  computing segmentation
INFO:  computing column separators
INFO:  considering at most 0 whitespace column separators
INFO:  debug _1thresh.png
INFO:  debug _2grad.png
INFO:  debug _3seps.png
INFO:  debug _4seps.png
INFO:  debug _colwsseps.png
INFO:  computing lines
INFO:  debug _cleaned.png
INFO:  debug _lineseeds.png
INFO:  debug _seeds.png
INFO:  propagating labels
INFO:  spreading labels
INFO:  number of lines 14
INFO:  finding reading order
INFO:  writing lines
INFO:      12  book/0001.bin.png 57.2 13
ocropus-gpageseg 'book/????.bin.png'
INFO:
INFO:  ########## /usr/local/bin/ocropus-gpageseg book/????.bin.png
INFO:
INFO:  book/0001.bin.png
INFO:  scale 57.236352
INFO:  computing segmentation
INFO:  computing column separators
INFO:  considering at most 3 whitespace column separators
INFO:  computing lines
INFO:  propagating labels
INFO:  spreading labels
INFO:  number of lines 54
INFO:  finding reading order
INFO:  writing lines
INFO:       1  book/0001.bin.png 57.2 2

_lineseeds.png seems to be identifying all the lines.

_lineseeds