Open 1a1a11a opened 8 years ago
Did you activate the detection of black column separators using the -b option? You also have to specify the number of black separators with the --maxseps option. The default is 2. However, if no continuous black line is detected, the specified maximum number of black separators must be higher than the actual number.
Thank you for your reply! I used both -b and --maxseps options, but I don't quite understand what you mean by "the specified maximum number of black separators must be higher than the actual number". I have three black columns each separated by a black vertical line, but the vertical line is occasionally partial disconnected (historical documents, ~100yeas, so sometimes and black line is not that clear). I can understand the program might connect two columns if the separator is not clear, but sometimes, it will just treat the black separator as characters 'l, i, 1' and I don't know why. Can you help? @jze Thank you!
I am not sure whether I stated it clearly, if not, just let me know, maybe I can find some examples.
Have you tried the debug mode -d ? With debug enable ocropus-gpageseg writes an image _colseps.png. In this image the detected column separators are highlighed. I have not found the parameter that controls how disconnected vertical black lines might be to be treated as a continuous line. In an experiment a hole of 5 pixels cut the line in two. A 4 pixel hole was ignored. (scale was 19.079)
I haven't try debug mode, let me try it, sounds like a good suggestion. In your experiment, do you know how to control the number of pixels (sounds like a threshold) for cutting?
Hello, I am facing a similar problem with historic documents.
The black lines separating columns are sometimes disconnected. Is there a way of specifying a threshold saying, in essence, ignore holes in black lines smaller than X pixels?
I'm attaching two imgaes, showing the binarized text and the columns detected in debug mode with ./ocropus-gpageseg --maxcolseps 0 -b --maxseps 10 --sepwiden 50 -d -n 0001.bin.png
.
What I would like to achieve in this image is two column separators. One separating the two column at the top, one separating the columns at the bottom.
Any ideas how this could be done?
@sepastian You can try to modify the ocropus-gpageseg
algorithm slightly (parameters may not giving you much options). I played a little around and modified the algorithm for this case to
def compute_separators_morph(binary,scale):
"""Finds vertical black lines corresponding to column separators."""
d0 = int(max(5,scale/4))
d1 = int(max(5,scale))+args.sepwiden
thick = morph.r_dilation(binary,(d0,d1))
DSAVE("B-thick",thick)
vert = morph.rb_opening(thick,(5*scale,1))
DSAVE("B-vert1",vert)
vert = morph.r_erosion(vert,(d0//2,args.sepwiden))
vert = morph.select_regions(vert,sl.dim1,min=3,nbest=2*args.maxseps)
vert = morph.select_regions(vert,sl.dim0,min=5*scale,nbest=args.maxseps)
return vert
And then call ./ocropus-gpageseg tests/twocolumns.bin.png --debug -n -b --maxcolseps 0 --maxseps 10 --usegauss
leads to
This seems better but not yet perfect. Maybe you have to play with the binarization process also. However, sometimes the black line separator is just not printed there, which might not be able to fix easily afterwards (if at all).
Thank you, I will try this on Monday!
On Feb 15, 2017 23:32, "Philipp Zumstein" notifications@github.com wrote:
@sepastian https://github.com/sepastian You can try to modify the ocropus-gpageseg algorithm slightly (parameters may not giving you much options). I played a little around and modified the algorithm for this case to
def compute_separators_morph(binary,scale): """Finds vertical black lines corresponding to column separators.""" d0 = int(max(5,scale/4)) d1 = int(max(5,scale))+args.sepwiden thick = morph.r_dilation(binary,(d0,d1)) DSAVE("B-thick",thick) vert = morph.rb_opening(thick,(5scale,1)) DSAVE("B-vert1",vert) vert = morph.r_erosion(vert,(d0//2,args.sepwiden)) vert = morph.select_regions(vert,sl.dim1,min=3,nbest=2args.maxseps) vert = morph.select_regions(vert,sl.dim0,min=5*scale,nbest=args.maxseps) return vert
And then call ./ocropus-gpageseg tests/twocolumns.bin.png --debug -n -b --maxcolseps 0 --maxseps 10 --usegauss leads to
[image: _lineseeds] https://cloud.githubusercontent.com/assets/5199995/22998152/cac2bae2-f3d5-11e6-9b82-710166158074.png
This seems better but not yet perfect. Maybe you have to play with the binarization process also. However, sometimes the lines are just not printed there, which might not be able to fix easily afterwards (if at all).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tmbdev/ocropy/issues/89#issuecomment-280162045, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6QyycrrF4vDhZzL7XIJSO13qfk5INks5rc30WgaJpZM4Hr4zb .
I have pages with three columns separated by black column separator, but sometimes during the process, the black lines are recognized as character, does any one have idea how to solve that? Besides, sometimes two lines are also recognized as one line, is there any solution on that?