kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
72 stars 20 forks source link

Logical Tags/classes #66

Open zuphilip opened 7 years ago

zuphilip commented 7 years ago

I don't understand how the logical tags in hOCR should be used. Moreover, I see potential conflicts with other nested tags from the layout. AFAIK ocropus itself does not use any logical tags and tesseract only supports ocr_par. For most hocr logical classes there are equivalent html tags and therefore I don't see any advantage to add special logical hocr classes there.

Some more specific questions about the logical hocr classes:

What do you think?

kba commented 7 years ago

As @mittagessen said, the semantics of these tags are pure guesswork, since there is little in the spec beyond "These logical tags have their standard meaning as used in the publishing industry and tools like LaTeX, MS Word, and others."

For most hocr logical classes there are equivalent html tags and therefore I don't see any advantage to add special logical hocr classes there.

If you don't need the logical classes, you can just use the typesetting classes. hOCR obviously comes from a time before HTML5 (newer tags, data- attributes, microdata etc.), it's more like microformats. You can then use e.g. nested <section|article|address|div> or some other tag/mechanism for organising logical structure. It would only be relevant if any tools expected these classes to have meaning but since no one produces them, no one consumes them.

Is the ocr_document the same as the html document

No, I wouldn't introduce that restriction, plus it would be redundant. I think it's more of an optional indicator where the OCR document begins vs. where the pages are.

can there be multiple ocr_documents in the same html document?

Yes, since we have no semantics, I would not restrict unless there's a good reason.

Should ocr_authors be used to indicate some "byline" area or should there be some metadata about the authors given there?

What is ocr_display?

I think @tmbdev was strongly inspired by LaTeX, both in terms of hierarchical structure as well as typesetting. In LaTex, display math mode means block level formulas as opposed to inline. C.f. http://kba.github.io/hocr-spec/1.2/#ocr_math

"The standard HTML tags given in brackets specify the preferred HTML tags to use with those logical structuring elements." How exactly are these elements used? Are the just marking the beginning of something new or should the be nested into each other?

Not sure if I understand your question. I think they tags in square brackets represent the HTML tag name you should use for an element with this class.

ocr_part [<h1>] ->

<h1 class='ocr_part'> ... </h1>

EDIT Now I understand. that paragraph is in the wrong section, I'll fix it.

Is ocr_linear a special case of ocr_par or why is it inside this subsection?

No, that is just an error. We should turn those into something more compact, as done for metadata and HTML markup section.

amitdo commented 7 years ago

There are too many things in the spec which are not very clear.

I see two possible solutions:

  1. Try to understand the original author intent. This is often a guesswork.
  2. Ask the original author (Tom) to clarify a few things for us.
amitdo commented 7 years ago

I didn't see @kba comment before sending mine (he send his while I was editing mine).

It's funny we both used the word 'guesswork'.

kba commented 7 years ago
  1. Ask the original author (Tom) to clarify a few things for us.

Agreed, but I'd say the semantics of the logical structuring elements are low priority and should probably just be handled in a subsection. HTML has good mechanisms for expressing the logical structure of a document.

Not saying, we shouldn't ask Tom, but that I think it's more important to have the semantics and mechanics of features specified that might actually be used but aren't (?), such as reading order (ocr_linear) and grouping and so on.

It's Funny we both used the word 'guesswork'.

It is kinda telling :)

zuphilip commented 7 years ago

Okay, it looks that we agree that this section involves a lot of "guesswork" but the logical structure elements are not used much anyhow and therefore is only low priority. I added the label "postpone" here for now.

amitdo commented 7 years ago

My comment about the 'guesswork' was general. #28 and #19 are just two examples for that.

tmbdev commented 7 years ago

On Thu, Oct 20, 2016 at 4:40 AM, Konstantin Baierer < notifications@github.com> wrote:

As @mittagessen https://github.com/mittagessen said https://github.com/kba/hocr-spec/issues/17#issuecomment-252614669, the semantics of these tags are pure guesswork, since there is little in the spec beyond "These logical tags have their standard meaning as used in the publishing industry and tools like LaTeX, MS Word, and others."

I'm not sure what additional semantics you are looking for. The logical markup in hOCR is basically the same as that found in LaTeX and is intended to have the same semantics.

For most hocr logical classes there are equivalent html tags and therefore

I don't see any advantage to add special logical hocr classes there.

The reason hOCR defines how to encode logical markup as either HTML tags or as hOCR classes is because there are different use cases that require one or the other. Keep in mind that hOCR isn't just an encoding of OCR output in HTML, it is actual HTML that can be displayed in a browser. When you display it in a browser, you can copy and paste it, and the OCR metadata gets copied along with the text itself (this is not true for formats like ALTO). Sometimes, in such use cases, it is OK to use HTML tags directly, in other cases, you want to keep the logical layout information around but not have it affect the HTML presentation.

If you don't need the logical classes, you can just use the typesetting classes. hOCR obviously comes from a time before HTML5 (newer tags, data- attributes, microdata etc.), it's more like microformats http://microformats.org/.

hOCR was developed around the time HTML5 came out, but it seemed important at the time to still support older versions of HTML. I'm not sure that is still true. It may be worth revisiting that question.

Is the ocr_document the same as the html document

can there be multiple ocr_documents in the same html document?

Should ocr_authors be used to indicate some "byline" area or should there be some metadata about the authors given there?

What is ocr_display?

I think @tmbdev https://github.com/tmbdev was strongly inspired by LaTeX, both in terms of hierarchical structure as well as typesetting. In LaTex, display math mode means block level formulas as opposed to inline. C.f. http://kba.github.io/hocr-spec/1.2/#ocr_math

Correct. Basically, for any of these tags, the intent is to follow what LaTeX does. For example, as in LaTeX, ocr_author does not encode document metadata, it merely indicates that an area of the page contains author information, in no particular format (it might even be an image). For actual, machine readable document metadata, hOCR uses Dublin Core, but that is unrelated to the logical layout tags.

Is ocr_linear a special case of ocr_par or why is it inside this subsection?

No, that is just an error. We should turn those into something more compact, as done for metadata and HTML markup section.

The nesting hierarchy is indicated in the figure below; probably the list above should be merged with the hierarchy into a single figure.