ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.41k stars 590 forks source link

ocropus-gpageseg: Defective line splitting #210

Open wrznr opened 7 years ago

wrznr commented 7 years ago

Expected Behavior

Simple running text should be consistently split into lines.

Current Behavior

Currently working on data from the Grenzboten project together with @uvius. For some images, line splitting does not work. It is not clear why because very similar images are split correctly.

Steps to Reproduce (for bugs)

  1. Download test files:

179411_01 nrm 179411_01 bin

  1. Run ocropus-gpageseg on testfile(s).
  2. Inspect results.

Test files have been created with ocropus-nlbin. Tested various command line parameter settings without success.

Your Environment

zuphilip commented 7 years ago

Okay, I looked at the debug output with --debug and it seems that the detected scale is too small (approximately half of the correct size):

scale-default_lineseeds

The disconnected (red) components are then creating the different lines.

If you increase that value by hand by setting the --scale parameter:

ocropus-gpageseg grenzboten.bin.png -n --debug --scale 30

then the output looks good:

scale-30_lineseeds

(Don't forget to remove all old images from the directory containing the lines.)

wrznr commented 7 years ago

Alright, thanks. That fixes the issue for the specific image (and many others). But if I set this parameter globally for the whole (pre-segmented) book, new problems arise with smaller (e.g. on-line images). Is there a known bug in the scale detection?

zuphilip commented 7 years ago

new problems arise with smaller (e.g. on-line images)

I don't know what exactly you mean with "on-line images", but in general when you have to deal with font sizes which vary much (header vs. body text vs. footnote text), then ocropus has some problems and you might need some other steps.

Is there a known bug in the scale detection?

Nothing I am aware of, but the example you provide looks like not an optimal guess from ocropus for the scale parameter. My guess is that for your test image the binarization will produce characters that are splitted into several connected components, and this influences the estimation of the scale parameter. I tried another binarization method here, and then the result seems also okay.

wrznr commented 7 years ago

Sorry @zuphilip. This is a typo and should be "one-line images" (i.e., images which cover only a single line). So it's not the varying font size but rather varying clipping sizes from the whole page image which cause the issues.

I tried another binarization method here, and then the result seems also okay.

This is a great idea. I used ocropus-nlbin which seems the most obvious choice. From my experience, the tesseract line splitting is far superior to ocropous-gpagseg but this probably boils down to binarization.

Many thanks for your ongoing support!

amitdo commented 7 years ago

https://github.com/tmbdev/ocropy/blob/master/OLD/ocropus-sauvola

zuphilip commented 7 years ago

The scale estimation in ocropus for your example will produce this scalemap

grenzboten-scalemap

As far as I understand the following happen then: For each of these boxes the algorithm continues to calculate the area and then take the square root (i.e. geometric mean of width and height). Overall the median of these numbers (without outliners) is then taken. Maybe in your example there are too many small connected components an/or the font is too narrow...

( The corresponding Jupyter notebook is here: https://gist.github.com/zuphilip/e551ba6b733b5094749799651e4fbd3e )

Sauvola is one possibility and I the ocropus-nlbin has more parameters to try out. Moreover, it should be possible to mix some of the steps of Tesseract with some of the steps with Ocropus.

@wrznr Thank you for asking interesting questions!

wrznr commented 7 years ago

Indeed, using e.g. scantailor for binarization results in an almost error-free line splitting! Only small one-line segments like page numbers and signature marks (which is probably to be expected) are not correctly processed. Significant step forward!

While this is great for me, is it a problem for ocropus (I.e. problems in the combination of nlbin and gpageseg)?