kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
73 stars 20 forks source link

Future of hOCR #17

Open kba opened 8 years ago

kba commented 8 years ago

hOCR is easy to implement because it's based on HTML but it can hardly be called a standard while there are living standards for OCR like ALTO.

hOCR is used by Open Source engines like tesseract, ocropy, kraken, cuneiform. Is their output spec-conformant and uniform? Would it not be better to enhance them to support ALTO if they do not already?

I like hOCR's approach for extensibility and microformat-like simplicity but it has not been updated for several years and I think it should not be used for new implementations unless there are very compelling reasons not to use ALTO.

That being said, there is software around that produces hOCR and related tools that expect hOCR (or some dialect of it).

What I think needs to be done in any case:

  1. Reduce the specs to the parts that are in actual use
  2. Restructure it to make it more coherent and provide more examples
  3. Produce a new major version indicating those changes and removals.

That new version should either be developed/refined further (e.g. by standardizing x* properties/classes) or contain a prominent deprecation notice that recommends another format like ALTO.

CC @tmbdev @mittagessen @zdenop @amitdo @zuphilip @cneud @stweil

stweil commented 8 years ago

For Tesseract there exists an open issue (https://github.com/tesseract-ocr/tesseract/issues/419) to add ALTO support.

amitdo commented 8 years ago

Reduce the specs to the parts that are in actual use

There are some features in the Tesseract API which can be presented with hOCR, but currently they aren't. A future release may change that.

wanghaisheng commented 8 years ago

in my recent work we are dealing with thousands pdf file ,some of them are actually pic we rely on OCR. are there anyone ever taking a look at the https://github.com/coolwanglu/pdf2htmlEX https://github.com/modesty/pdf2json ,forget to mention my favorite https://github.com/euske/pdfminer how these tools perserve the style info ,we are trying to get a unified format ,in one way we can only only one tools/a set of rules to extract info from them ,in one way ,we can turn them into all kinds of presentation forms(html pdf xps image or whatever)

mittagessen commented 8 years ago

kraken currently relies on some rather bothersome to implement/parse features of the spec, especially character coordinates which are encoded using a weird running delta format. From personal experience hOCR outputs are not compatible between engines as there're different dialects, most notoriously misspelled ocr_word tags in older tesseract versions and hOCR is not really used by any third party tools as an input format.

In addition, the specification is barely worth calling it that. Almost no semantics are defined for any tags, there are arbitrary distinctions between ocr/ocrx, some features are tailored to ocropus, and lots of the feature set have never been used anywhere.

I have high hopes for ALTO as they seem to be quite receptive for the features I'd need to completely replace hOCR/TEI-OCR in our applications. There is already a rudimentary jinja template for kraken although including character coordinates isn't possible just yet.

amitdo commented 8 years ago

@mittagessen

some features are tailored to ocropus, and lots of the feature set have never been used anywhere.

@zuphilip in #66

AFAIK ocropus itself does not use any logical tags

ocropy barely uses the hOCR features

https://github.com/tmbdev/ocropy/blob/1a906e3ac/ocropus-hocr https://github.com/tmbdev/ocropy/blob/1a906e3ac/ocrolib/hocr.py

What it uses:

kba commented 8 years ago

Kraken's hOCR implementation:

<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
        <meta name="ocr-system" content="kraken"/>
        <meta name="ocr-capabilities" ccontent="ocr_page ocr_line ocrx_word"/>
    </head>
    <body>
        <div class="ocr_page" title="bbox 0 0 {{ page.size|join(' ') }}, image {{ page.name }}">
            {% for line in page.lines %}
            <span class="ocr_line" id="line_{{ line.index }}" title="bbox {{ line.bbox|join(' ') }}; cuts {{ line.deltas }}">
            {% for segment in line.recognition %}
                <span class="ocrx_word" id="segment_{{ segment.index }}" title="bbox {{ segment.bbox|join(' ') }}, x_conf {{ segment.confidences|join(' ') }}">{{ segment.text }}</span>
            {% endfor %}
            </span>
            <br/>
            {% endfor %}

        </div>
    </body>
</html>
amitdo commented 8 years ago

hOCR is used by Open Source engines like tesseract, ocropy, kraken, cuneiform.

  • Cuneiform is not maintained for many years (5-6),
  • ocropy - as said, currently it barely uses the hOCR features.
  • kraken - it seems that @mittagessen doesn't like the hOCR format (or rather, its spec) and wants to move to another OCR format.
    @mittagessen, do you plan to drop the hOCR output in the near future?
  • We are left with Tesseract ... :)
amitdo commented 8 years ago

Kraken's hOCR implementation:

<span class="ocrx_word" id="segment_{{ segment.index }}" title="bbox {{ segment.bbox|join(' ') }}, x_conf {{ segment.confidences|join(' ') }}">{{ segment.text }}</span>

It should be x_wconf instead of x_conf.

id="segment_ - we don't have it in the spec.

kba commented 8 years ago

id="segment_ - we don't have it in the spec.

It's part of HTML/XML, all elements can have one id=. But we could add that they SHOULD have an id or MUST have one, is that what you mean?

amitdo commented 8 years ago

It's part of HTML/XML, all elements can have one id=. But we could add that they SHOULD have an id or MUST have one

Agree.

Also, there are a few examples for using ids in the spec. id="segment_ ..." is not one of them...

mittagessen commented 8 years ago

There was a typo and x_conf(s) is actually correct. They are character/glyph/grapheme confidences as ocropus/kraken doesn't have a notion of words as such. I like to call them segments as I feel uncomfortable going as far as using "words" although the Unicode word segmentation algorithm is used to produce them.

Unfortunately, the format of just having a Unicode string inside spans/divs and associating a list of confidences is utterly useless without any kind of explicit mapping mechanism between particular code points and confidences. People like to renormalize data to one of the different normalization formats and then correspondence can't be established anymore without having access to the original classifier or rather its output alphabet. To be fair, the same issue applies to ALTO right now.

I will keep hOCR support for the foreseeable future, as I can see a use case for having a format serializing at least some useful metadata such as source images and word bounding boxes while being viewable without a separate style sheet in a browser. The template is fixed and maintenance quite low effort mostly because nobody's using hOCR for anything directly in a manner where they'd expect interoperability between implementations. Although I pity the developer who ever has to parse running deltas to extract glyph bounding boxes.

amitdo commented 8 years ago

x_conf(s) is actually correct

Sorry, I misread it.

Tesseract has x_wconf. Its value (0-100) actually give you the confidence of a glyph in a 'word' - the glyph with the lowest confidence in the string.

Unfortunately, the format of just having a Unicode string inside spans/divs and associating a list of confidences is utterly useless without any kind of explicit mapping mechanism between particular code points and confidences. People like to renormalize data to one of the different normalization formats and then correspondence can't be established anymore without having access to the original classifier or rather its output alphabet. To be fair, the same issue applies to ALTO right now.

Although I pity the developer who ever has to parse running deltas to extract glyph bounding boxes.

Since currently you are AFAIK the only OCR engine* that output that information in hOCR, maybe you want to suggest a better syntax that we might use in 'hOCR v2.0' ?

* There is an open PR to add glyph-level info to Tesseract: https://github.com/tesseract-ocr/tesseract/pull/310 I don't know about Cuneiform's hOCR, but I don't care about it since it's a dead project.

mittagessen commented 8 years ago

It would require a substantial reworking of the character encoding process and will definitely break browser viewability, especially for RTL and BiDi text. Basically what you'd have to do is encode each glyph (which may be multiple code points) beneath a separate tag and then associate a confidence value with that, e.g.:


   <span class="ocr_line">
      <span class="ocrx_word">
          <span class="ocrx_glyph" title="x_conf 95">this_is_a_glyph</span>
          <span class="ocrx_glyph" title="x_conf 99">this is another glyph</span>
          ...
      </span>
      ...
   </span>
   ...

It will break display on any HTML based viewer as spans are not without semantics for a number of rendering algorithms (word segmentation and directionality). AFAIK there's no completely semanticless tag without line breaking properties that could be substituted.

There's already an encoding for per-glyph bounding boxes, cuts but it's quite annoying to implement and parse, so an explicit dumping as proposed in the PR is probably preferable. And Nick White's doing work on polytonic Greek so there are only two or three commonly used glyphs that can't be encoded into a single code point breaking the whole format.

tmbdev commented 8 years ago

On Sat, Oct 22, 2016 at 4:41 AM, mittagessen notifications@github.com wrote:

There's already an encoding for per-glyph bounding boxes, cuts https://kba.github.io/hocr-spec/1.2/#cuts but it's quite annoying to implement and parse, so an explicit dumping as proposed in the PR is probably preferable

The attribute for bounding boxes is "x_bboxes"; if you want to output bounding boxes, use that. It's in the "engine specific" section because they are, in fact, engine specific (meaning, different engines may have different conventions for how to convert the same segmentation into bounding boxes, although they are usually close).

The "cuts" attribute is for representing cuts. It exists as a compact, pixel-accurate representation of a character segmentation. Cuts are not bounding boxes, and, in fact, are not all that useful unless you have the original page image available.

The "x_bboxes" and "cuts" attributes are both useful and used for very different purposes from each other. If you have the original image available, you can convert cuts into fairly good bounding boxes (otherwise you can't, at least not well).

None of these are "per-glyph" because "glyph" isn't a uniquely defined concept independent of font. As far as hOCR is concerned, you need to output information per codepoint. There is no single correct way of doing that: it depends on the script, the encoding, and the OCR engine.

For bounding boxes (or cuts) on accented Western scripts, my recommendation would be: (1) view the whole accented letter as a single glyph, (2) use normalized unicode with composed characters, (3) if a single glyph corresponds to multiple codepoints, output a bounding box for the first codepoint and output empty bounding boxes for the remaining codepoints.

Both cuts and x_bboxes are really engine-specific segmentation info that may be useful for some training and display heuristics (cuts should probably be x_cuts). Furthermore, cuts and x_bboxes make sense only for some scripts and some OCR engines. Increasingly, modern OCR engines don't generate or use per-glyph geometric information at all. So, generally speaking, I think it's a bad idea for any consumer of OCR output to rely on the presence of per-glyph geometric information.

As in all other areas, hOCR gives you a way of representing the information you have without loss or conversion, but it doesn't mandate anything. Per-glyph boxes may be useful and may make sense for Western alphabetic scripts using segmenting OCR engines on high quality inputs, but for many other kinds of OCR problems, they simply aren't useful. The intent behind hOCR is that if your processing problem requires information like this, you make that explicit as a profile. So, if you build an application that assumes it gets per-glyph bounding boxes, then your input profile requires that, and it means that your application will only work with OCR engines that produce such output and writing systems for which that makes sense. It's a good idea to think carefully of what the minimum requirements for your application are. For example, many applications actually probably need little more than ocr_line tags with bounding boxes and text in reading order, something almost any OCR engine produces.

tmbdev commented 8 years ago

On Mon, Oct 10, 2016 at 6:14 AM, mittagessen notifications@github.com wrote:

there are arbitrary distinctions between ocr/ocrx

There are three ocrx tags ("block", "line", "word"); they correspond to page elements that often have the same name in different engines, but are defined and detected differently. For example, a "word" can either be a group of characters surrounded by space, or it can be a linguistic word, or some combination of the two, and engines often don't even document it. The "ocrx" prefix simply alerts you to that. Concepts like "columns", "paragraphs", and (regular) "text lines", on the other hand, have well-defined meanings in terms of typesetting.

tmbdev commented 8 years ago

On Thu, Oct 20, 2016 at 7:46 AM, Amit D. notifications@github.com wrote:

ocropy barely uses the hOCR features

https://github.com/tmbdev/ocropy/blob/1a906e3ac/ocropus-hocr https://github.com/tmbdev/ocropy/blob/1a906e3ac/ocrolib/hocr.py

ocropy only contains basic physical layout analysis; experimental tools for logical layout analysis, and even features such as text/image segmentation, never made it into the open source release, so nothing in ocropy produces those tags. Those tools take physically marked up hOCR and produce logically marked up hOCR as output.

We've also used many more of the hOCR features when converting OCR training databases and OCR output from other engines into a uniform format. That is, many of the hOCR tags are tags that occurred in other formats and needed a translation.

Tom

tmbdev commented 8 years ago

On Mon, Oct 10, 2016 at 6:14 AM, mittagessen notifications@github.com wrote:

kraken currently relies on some rather bothersome to implement/parse features of the spec, especially character coordinates which are encoded using a weird running delta format. From personal experience hOCR outputs are not compatible between engines as there're different dialects,

Well, being able to represent engine-specific and intermediate information is the main point of hOCR. Having a uniform representation of OCR output is the point of ALTO. They are two different specs for two different purposes with two different use cases.

I have high hopes for ALTO as they seem to be quite receptive for the features I'd need to completely replace hOCR/TEI-OCR in our applications.

And if ALTO works as an output format for your application, that's great.

Tom

tmbdev commented 8 years ago

On Fri, Oct 21, 2016 at 7:26 PM, mittagessen notifications@github.com wrote:

x_conf is actually correct. They are character/glyph/grapheme confidences as ocropus/kraken doesn't have a notion of words as such. I like to call them segments as I feel uncomfortable going as far as using "words" although the Unicode word segmentation algorithm is used to produce them.

Unfortunately, the format of just having a Unicode string inside spans/divs and associating a list of confidences is utterly useless without any kind of explicit mapping mechanism between particular code points and confidences. People like to renormalize data to one of the different normalization formats and then correspondence can't be established anymore without having access to the original classifier or rather its output alphabet. To be fair, the same issue applies to ALTO right now.

That's why it is in the engine-specific section. Tags like x_conf are not intended for you to output confidences in new OCR engines, they are intended to be able to take something like existing ABBYY output and encode its existing "character" and "word" confidences inside an hOCR file.

I will keep hOCR support for the foreseeable future, as I can see a use case for having a format serializing at least some useful metadata such as source images and word bounding boxes while being viewable without a separate style sheet in a browser. The template is fixed and maintenance quite low effort mostly because nobody's using hOCR for anything directly in a manner where they'd expect interoperability between implementations.

Correct. hOCR is not intended to be a single archival format like ALTO. One OCR engine might only output ocr_lines, another might only output ocrx_blocks. hOCR is intended to be a set of conventions for encoding OCR-related information in HTML. For example, you can use hOCR to encode traditional ABBYY output format, or the result of OCRopus physical layout analysis, or UNLV manually annotated ground truth data, or many other formats. In addition, hOCR allows you to preserve OCR-related metadata when cutting and pasting between HTML text areas.

The intent was that hOCR guarantees interoperability through specific profiles. De facto, there is a "basic_physical_layout" profile for ocropy and Kraken output. It would also probably be possible to define an "ALTO" profile for hOCR, which maps bidirectionally onto ALTO.

Although I pity the developer who ever has to parse running deltas to extract glyph bounding boxes.

You're free to encode bounding boxes in OCR output directly if you like. Cuts are for pixel-accurate segmentation in the presence of kerning, something bounding boxes can't represent.

For decoding the cuts, you can use this function:

def decode_cuts(s, x=0, ymax=None): print repr(x) cuts = [] for path in s.split(): turns = [int(p) for p in path.split(",")] print repr(x), repr(turns) x += turns[0] pos = [x, 0] cut = [tuple(pos)] for i, d in enumerate(turns[1:]): pos[(i+1)%2] += d cut.append(tuple(pos)) if ymax is not None: pos[1] = ymax cut.append(tuple(pos)) cuts.append(cut) return cuts

To convert these to tight bounding boxes, you need the original binary image (it's another 10-20 lines to do that conversion).

tmbdev commented 8 years ago

On Wed, Sep 14, 2016 at 9:29 AM, Konstantin Baierer < notifications@github.com> wrote:

I like hOCR's approach for extensibility and microformat-like simplicity but it has not been updated for several years and I think it should not be used for new implementations unless there are very compelling reasons not to use ALTO.

ALTO addresses a different problem from hOCR, and I think it is not as good as an initial output format for OCR engines, both due to its design and due to its complexity.

My recommendation is generally to create initial output for OCR engines in engine-specific hOCR format and then transform that into ALTO if you need ALTO, or generic hOCR if you want an HTML representation with embedded OCR information.

Tom

amitdo commented 8 years ago

Thanks Tom!

I hope nothing said here hurts your feelings...

tmbdev commented 8 years ago

Well, these things should be better explained in the spec. hOCR is just a different kind of spec for a different purpose than ALTO.

On Tue, Oct 25, 2016, 12:23 Amit D. notifications@github.com wrote:

Thanks Tom!

I hope nothing said here heart your feelings...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kba/hocr-spec/issues/17#issuecomment-256149087, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUYP1o8P-eKWPVK9aMsZB5hnks50nwtks5q3lc_gaJpZM4J8-Zr .

mittagessen commented 8 years ago

When seen as a format to serialize as much engine specific information as possible, hOCR makes a lot more sense although most people tend to use some kind of TEI profile as it is more flexible for that purpose (and hence infinitely more complex).

For some reason I missed the x_bboxes part of the specification

My point with the x_cuts, xconfs, x* still stands even if you cut it down to a single engine and reencoding existing output. Without access to the particular model it is still impossible to align confidences/bboxes with code points even when you can make sure that nobody "tampered" with the file by renormalizing it to another Unicode normalization. The fundamental reason is that there is no mapping between Unicode code points and recognition units. Formats like AbbyyXML actually allow this alignment by being designed bottom-up (glyph-first) instead of top down like hOCR. I use "glyph" as the lowest level of label an engine may produce.

While per-character bounding boxes are indeed rather useless (and techniques like CTC layers may or may not produce them randomly), quite a few people seem keen on confidences for postprocessing.

BTW columns and paragraphs have no typesetting definition whatsoever. They are purely semantic, e.g. a paragraph typeset in Europe follows quite different topographical conventions than one typeset in the US, topograpical "columns", e.g. in Persian poetry sometimes don't affect reading order at all (being read continuously across "columns"). The highest unit of typesetting everybody seems to agree upon may be lines, although I'm fairly sure somebody will prove me wrong given enough time.

wanghaisheng commented 8 years ago

"The intent was that hOCR guarantees interoperability through specific profiles" where can i get some of those

tmbdev commented 8 years ago

You define them for your application and document them.

For example, your application might require ocr_line and ocr_column, so you check for their presence and document that you need them.

On Tue, Oct 25, 2016 at 7:07 PM, wanghaisheng notifications@github.com wrote:

"The intent was that hOCR guarantees interoperability through specific profiles" where can i get some of those

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kba/hocr-spec/issues/17#issuecomment-256230615, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUYP_tGcqMIKmwJU5qb0x0u_o3QIfbdks5q3rXdgaJpZM4J8-Zr .

tmbdev commented 8 years ago

On Tue, Oct 25, 2016 at 5:00 PM, mittagessen notifications@github.com wrote:

My point with the x_cuts, xconfs, x* still stands even if you cut it down to a single engine and reencoding existing output. Without access to the particular model it is still impossible to align confidences/bboxes with code points even when you can make sure that nobody "tampered" with the file by renormalizing it to another Unicode normalization. The fundamental reason is that there is no mapping between Unicode code points and recognition units. Formats like AbbyyXML actually allow this alignment by being designed bottom-up (glyph-first) instead of top down like hOCR. I use "glyph" as the lowest level of label an engine may produce.

As I was saying, the way to represent Abbyy/XML information with x_bboxes is to put the bounding box on the first codepoint corresponding to the glyph and then put empty bounding boxes on the remaining codepoints. Where do you see the problem/limitation?

While per-character bounding boxes are indeed rather useless (and techniques like CTC layers may or may not produce them randomly), quite a few people seem keen on confidences for postprocessing.

Sure, but that is also highly engine specific. Confidences might be derived from raw recognition lattices, or after applying a language model, so you don't know how they relate to glyphs.

The only confidence information that can be interpreted independent of engine is a set of recognition alternatives with associated probabilities. hOCR lets you represent that.

wollmers commented 7 years ago

@mittagessen @tmbdev

IMHO the lowest level bboxes for text should represent graphemes (in the wide sense, not unicode "extended grapheme clusters").

E. g. Fraktur has many ligatures like ch and ck which are not single codepoints in unicode.

Tesseract can produce a box file with the results of the character segmentation which looks e.g. for the word "stecken" processed with the training data deu_frak like this:

st 707 754 727 790 0
e 731 760 742 781 0
ck 740 760 765 789 0
e 766 760 776 780 0
n 779 760 795 781 0

The hOCR contains

<span class='ocrx_word' id='word_1_337' title='bbox 707 2027 795 2063; x_wconf 81' lang='deu-frak' dir='ltr'>stecken</span>

Now if I want to have bboxes at grapheme level in the hOCR I would propose an element ocrx_segment (or ocrx_grapheme) to be used like this:

<span class='ocrx_word' id='word_1_337' title='bbox 707 2027 795 2063; x_wconf 81' lang='deu-frak'>
  <span class='ocrx_segment' title='bbox 707 754 727 790; x_sconf 81'>st</span>
  <span class='ocrx_segment' title='bbox 731 760 742 781; x_sconf 81'>e</span>
  <span class='ocrx_segment' title='bbox 740 760 765 789; x_sconf 81'>ck</span>
  <span class='ocrx_segment' title='bbox 766 760 776 780; x_sconf 81'>e</span>
  <span class='ocrx_segment' title='bbox 779 760 795 781; x_sconf 81'>n</span>
</span>

This has a clear structure i. e. a segment/grapheme can have one or more codepoints (base or combining characters.