kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
73 stars 20 forks source link

Drop support for polygons? #16

Open kba opened 8 years ago

kba commented 8 years ago

Polygons are obviously more flexible than rectangles but make the specs more complicated, e.g. #15

Are there any engines with ocrp_poly capability? Are there any examples in the wild?

zuphilip commented 8 years ago

Polygons =/= Polynomials, or is there any other connection to the issue?

Second question: I don't know, but this is a good question...

kba commented 8 years ago

Polygons =/= Polynomials, or is there any other connection to the issue?

Why polynomials :confused: :question:

zuphilip commented 8 years ago

https://www.google.de/search?q=ocrp_poly

kba commented 8 years ago

Not all engines indicate their capabilities, e.g. ocrp_lang. Searching for hocr and poly yields nothing either. I'm pretty sure that it is not used, I'm more looking for counter-examples.

amitdo commented 8 years ago

Tesseract API: https://github.com/tesseract-ocr/tesseract/blob/a75ab450a8cc/ccmain/pageiterator.h#L228

kba commented 8 years ago

I'm not very familiar with the tesseract code but from reading baseapi/renderer code I only see straightforward rectangles within rectangles, slanted or shifted but still bounding boxes.

I also searched for polygon-related code and saw it used in page segmentation but not serialized. Are there any plans to support bounding polygons in tesseract in the future?

amitdo commented 8 years ago

From the above link:

  /**
   * Returns the polygon outline of the current block. The returned Pta must
   * be ptaDestroy-ed after use. Note that the returned Pta lists the vertices
   * of the polygon, and the last edge is the line segment between the last
   * point and the first point. NULL will be returned if the iterator is
   * at the end of the document or layout analysis was not used.
   */
  Pta* BlockPolygon() const;

Although tesseracpageiterator.h is not placed under the api directory, it is part of the API.

This method is an alternative to getting a bounding box for a 'block'. None of the renderers (hOCR, PDF etc.) uses this method currently. Pta is defined in Leptonica.

amitdo commented 8 years ago

About Pta: https://github.com/DanBloomberg/leptonica/blob/1408893977a3/src/pix.h#L500 https://github.com/DanBloomberg/leptonica/blob/1408893977a3/src/ptabasic.c#L32