ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.42k stars 591 forks source link

Black column separator #89

Open 1a1a11a opened 8 years ago

1a1a11a commented 8 years ago

I have pages with three columns separated by black column separator, but sometimes during the process, the black lines are recognized as character, does any one have idea how to solve that? Besides, sometimes two lines are also recognized as one line, is there any solution on that?

jze commented 8 years ago

Did you activate the detection of black column separators using the -b option? You also have to specify the number of black separators with the --maxseps option. The default is 2. However, if no continuous black line is detected, the specified maximum number of black separators must be higher than the actual number.

1a1a11a commented 8 years ago

Thank you for your reply! I used both -b and --maxseps options, but I don't quite understand what you mean by "the specified maximum number of black separators must be higher than the actual number". I have three black columns each separated by a black vertical line, but the vertical line is occasionally partial disconnected (historical documents, ~100yeas, so sometimes and black line is not that clear). I can understand the program might connect two columns if the separator is not clear, but sometimes, it will just treat the black separator as characters 'l, i, 1' and I don't know why. Can you help? @jze Thank you!

1a1a11a commented 8 years ago

I am not sure whether I stated it clearly, if not, just let me know, maybe I can find some examples.

jze commented 8 years ago

Have you tried the debug mode -d ? With debug enable ocropus-gpageseg writes an image _colseps.png. In this image the detected column separators are highlighed. I have not found the parameter that controls how disconnected vertical black lines might be to be treated as a continuous line. In an experiment a hole of 5 pixels cut the line in two. A 4 pixel hole was ignored. (scale was 19.079)

1a1a11a commented 8 years ago

I haven't try debug mode, let me try it, sounds like a good suggestion. In your experiment, do you know how to control the number of pixels (sounds like a threshold) for cutting?

sepastian commented 7 years ago

Hello, I am facing a similar problem with historic documents.

The black lines separating columns are sometimes disconnected. Is there a way of specifying a threshold saying, in essence, ignore holes in black lines smaller than X pixels?

I'm attaching two imgaes, showing the binarized text and the columns detected in debug mode with ./ocropus-gpageseg --maxcolseps 0 -b --maxseps 10 --sepwiden 50 -d -n 0001.bin.png.

0001 bin

_colseps

What I would like to achieve in this image is two column separators. One separating the two column at the top, one separating the columns at the bottom.

Any ideas how this could be done?

zuphilip commented 7 years ago

@sepastian You can try to modify the ocropus-gpageseg algorithm slightly (parameters may not giving you much options). I played a little around and modified the algorithm for this case to

def compute_separators_morph(binary,scale):
    """Finds vertical black lines corresponding to column separators."""
    d0 = int(max(5,scale/4))
    d1 = int(max(5,scale))+args.sepwiden
    thick = morph.r_dilation(binary,(d0,d1))
    DSAVE("B-thick",thick)
    vert = morph.rb_opening(thick,(5*scale,1))
    DSAVE("B-vert1",vert)
    vert = morph.r_erosion(vert,(d0//2,args.sepwiden))
    vert = morph.select_regions(vert,sl.dim1,min=3,nbest=2*args.maxseps)
    vert = morph.select_regions(vert,sl.dim0,min=5*scale,nbest=args.maxseps)
    return vert

And then call ./ocropus-gpageseg tests/twocolumns.bin.png --debug -n -b --maxcolseps 0 --maxseps 10 --usegauss leads to

_lineseeds

This seems better but not yet perfect. Maybe you have to play with the binarization process also. However, sometimes the black line separator is just not printed there, which might not be able to fix easily afterwards (if at all).

sepastian commented 7 years ago

Thank you, I will try this on Monday!

On Feb 15, 2017 23:32, "Philipp Zumstein" notifications@github.com wrote:

@sepastian https://github.com/sepastian You can try to modify the ocropus-gpageseg algorithm slightly (parameters may not giving you much options). I played a little around and modified the algorithm for this case to

def compute_separators_morph(binary,scale): """Finds vertical black lines corresponding to column separators.""" d0 = int(max(5,scale/4)) d1 = int(max(5,scale))+args.sepwiden thick = morph.r_dilation(binary,(d0,d1)) DSAVE("B-thick",thick) vert = morph.rb_opening(thick,(5scale,1)) DSAVE("B-vert1",vert) vert = morph.r_erosion(vert,(d0//2,args.sepwiden)) vert = morph.select_regions(vert,sl.dim1,min=3,nbest=2args.maxseps) vert = morph.select_regions(vert,sl.dim0,min=5*scale,nbest=args.maxseps) return vert

And then call ./ocropus-gpageseg tests/twocolumns.bin.png --debug -n -b --maxcolseps 0 --maxseps 10 --usegauss leads to

[image: _lineseeds] https://cloud.githubusercontent.com/assets/5199995/22998152/cac2bae2-f3d5-11e6-9b82-710166158074.png

This seems better but not yet perfect. Maybe you have to play with the binarization process also. However, sometimes the lines are just not printed there, which might not be able to fix easily afterwards (if at all).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tmbdev/ocropy/issues/89#issuecomment-280162045, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6QyycrrF4vDhZzL7XIJSO13qfk5INks5rc30WgaJpZM4Hr4zb .