Open franvillamil opened 6 years ago
Hey @franvillamil,
could you please provide a whole example page? It is hard to reproduce the error with an image of a single column.
Several 1s together might appear like a black line? It probably could. I know that it needs "black line separators" of height 20*scale or 20*xheight to recognize it as delimiting a column, please see issue #250 .
During page segmentation there is also a step which deletes small component remove_noise
and another one which deletes vertical lines remove_hlines
. Moreover, it is possible that some characters interfer with the column segmentation itself.
However, as already @lehzwo mention, I also cannot reproduce your issues with the images you provide and I don't know whether you use any special parameter during the call.
@franvillamil if you provide more information such that we can reproduce the issue, then we can look here again, otherwise I suggest to close this issue.
I also faced this issue. Here are some sample images.
Commands used to process the image
pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-nlbin /home/chillaranand/projects/ocr/data/vishadam-021.png -o output -n
pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-gpageseg output/????.bin.png -n
pyflash.utils - INFO - python2 /home/chillaranand/projects/ocr/ocropy/ocropus-rpred -Q 4 -m /home/chillaranand/projects/ocr/ocropy/models/te.pyrnn.gz output/????.bin.png -n
@ChillarAnand Okay, it looks that for your examples the parts below the baseline (descenders?) are larger than expected. The computed lines look then like this:
and lines from the descender part are then neglected. Try to adjust the vscale/scale manually, e.g.
./ocropus-gpageseg temp/chillar0001.bin.png --debug -n --vscale 1.5
which should work well, except for the page number on the top left (but I think this is a known issue).
Thank you @zuphilip. With --vscale
it is segmenting correctly. How did you generate the image above?
How did you generate the image above?
The --debug
option produces such pictures in the folder where you are calling the script.
I tried the above image with different --vscale values. With --vscale 2.0
it gets all the text but gets 2 lines per image (ie. gets 13 images instead of 26). Without it, it gets only 2 lines in the whole page.
ocropus-gpageseg 'book/????.bin.png' --debug -n --vscale 2.0 --maxcolseps 0 --maxseps 0
INFO:
INFO: ########## /usr/local/bin/ocropus-gpageseg book/????.bin.png --debug -n
INFO:
INFO: book/0001.bin.png
INFO: scale 57.236352
INFO: computing segmentation
INFO: computing column separators
INFO: considering at most 0 whitespace column separators
INFO: debug _1thresh.png
INFO: debug _2grad.png
INFO: debug _3seps.png
INFO: debug _4seps.png
INFO: debug _colwsseps.png
INFO: computing lines
INFO: debug _cleaned.png
INFO: debug _lineseeds.png
INFO: debug _seeds.png
INFO: propagating labels
INFO: spreading labels
INFO: number of lines 14
INFO: finding reading order
INFO: writing lines
INFO: 12 book/0001.bin.png 57.2 13
ocropus-gpageseg 'book/????.bin.png'
INFO:
INFO: ########## /usr/local/bin/ocropus-gpageseg book/????.bin.png
INFO:
INFO: book/0001.bin.png
INFO: scale 57.236352
INFO: computing segmentation
INFO: computing column separators
INFO: considering at most 3 whitespace column separators
INFO: computing lines
INFO: propagating labels
INFO: spreading labels
INFO: number of lines 54
INFO: finding reading order
INFO: writing lines
INFO: 1 book/0001.bin.png 57.2 2
_lineseeds.png seems to be identifying all the lines.
I am using ocropy to extract data from old documents that list electoral results. These are big pages arranged in up to 4 colums, from which I have taken screenshots of each column. See a sample of the raw image (I know the quality is very bad, but is basically running OCR on this or copying it entirely by hand): https://user-images.githubusercontent.com/3774527/32782400-9111bab6-c948-11e7-9ea6-6266cc828627.png
To avoid problems, I'm trying to make ocropy read the text as a one-column text (see issue #240), after deleting every and so far it's more or less going well. In some cases, however, ocropy is deleting some parts (mainly numbers) when it does the segmentation. See below two screenshorts of the original binary file and the segmentation output (the
.nrm.png
files):Missing some part of '207', original binary: Image after segmentation:
The '3' is completely removed, original binary: After segmentation:
In some cases (e.g. when there are a few 1s, see below), it seems ocropy thinks these are black lines delimiting columns and tries to ignore them. But I don't want to do this, as I'm removing every black line that could be mistaken in Gimp.
Several 1s together might appear like a black line?:
Solution?
Does anyone know if there is any piece of code I can modify to avoid this? I've been looking into the
gpageseg
code but haven't found anything.Your Environment