kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
72 stars 20 forks source link

What exactly is `baseline` in @title? #15

Closed kba closed 7 years ago

kba commented 8 years ago

https://github.com/kba/hocr-spec/blob/master/hocr-spec.md#baseline:

baseline pn pn-1 ... p0 - a polynomial describing the baseline of a line of text the polynomial is in the coordinate system of the line, with the bottom left of the bounding box as the origin

If I understand correctly, this will be a tuple x y for all rectangular areas (with bbox)?

zuphilip commented 8 years ago

I interpret this differently. A polynomial can be written with coefficients p_i:

Therefore, baseline 0 0; would stand for the line y = 0*x+0 = 0 i.e. a horizontal line. And something like baseline 0.019 -22; would stand for y = 0.019 x - 22 which is a slightly skewed line shifted by 22.

kba commented 8 years ago

Good explanation, thanks. But when would there be more than two values for baseline?

zuphilip commented 8 years ago

We could try to run Tesseract on skewed_image

Don't know about more than two values...

zuphilip commented 8 years ago

I can confirm my theory. With the perfectly aligned test picture the first values (i.e. scope of the line) is zero or close to it. But when rotating this picture by 2° (convert -rotate 2) the first values is around 0.035 and we have arctan(0.035) ~= 2°. See here for the hocr file: test_picture_rotated.hocr.txt

zuphilip commented 8 years ago

Once the text lines have been found, the baselines are fitted more precisely using a quadratic spline. This was another first for an OCR system, and enabled Tesseract to handle pages with curved baselines [5], which are a common artifact in scanning, and not just at book bindings.

http://static.googleusercontent.com/media/research.google.com/de//pubs/archive/33418.pdf

But I don't know if Tesseract is actually (still) working like this...

kba commented 8 years ago

Does that mean, baseline is like the Bezier curves in image editing software? (I Am Not A Mathematician)

Baselines are slightly curved most of the time unless the book spine is removed before scanning, so it's sensible to represent them curved. I wonder if layout engines are able to do that. I could not find any mechanism in ALTO to represent curved baselines.

http://cennser.org/IJCVSP/finalPaper/030101.pdf

kba commented 8 years ago

Related: https://github.com/altoxml/schema/issues/32

zuphilip commented 8 years ago

Well, IMO the baseline would look like this:

baseline-red

There is even a Wikipedia article for baseline in typography.

In the ideal world this is really a (horizontal) line. But for handwriting or skewed scans of text they look differently. We can try to estimate them by some polynomial (or B-spline, Bezier curve or whatever) or just give the best line-approximating for it.

In the alto case I read that they want to indicate a "list of points" (how should they be connected together in the end?) or maybe they mean a list of values (?). Here in hocr one has to specify the coefficients of the polynom which as a function determines for each x-coordinate the corresponding y-coordinate.

kba commented 7 years ago

https://github.com/tesseract-ocr/tesseract/blob/a75ab450a8cc9a2b69cf05f5c4f7a39bc44cbacc/api/baseapi.cpp#L1344:

 * NOTE: The hOCR spec is unclear on how to specify baseline coefficients for
 * rotated textlines. For this reason, on textlines that are not upright, this
 * method currently only inserts a 'textangle' property to indicate the rotation
 * direction and does not add any baseline information to the hocr string.

The hOCR generation code in tesseract is easy to follow and well-documented btw.

amitdo commented 7 years ago

https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-interpret-hocr-baseline-output

kba commented 7 years ago

Thanks for the link, that's a really good description. We should incorporate a FAQ section with such information or just extend the baseline section, graphic information like the image by @StefRe helps a lot.

amitdo commented 7 years ago

The origin: https://groups.google.com/forum/#!topic/tesseract-ocr/azjzEHTIJUM

kba commented 7 years ago

@StefRe provided his sample for the spec, the common case with a straight baseline is resolved for me.

Perhaps we can expand @zuphilip's sample with a curved baseline in a new issue, with hOCR data that fits the sample image above.

amitdo commented 7 years ago

The Tesseract API page iterator has a method called 'Baseline()' that returns the baseline of a line or a word as two points (x1, y1 x2, y2).

amitdo commented 7 years ago

Tesseract API again. If you want 'list of points' (n points > 2) for a text line, you can build it from the line's words points.