kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
72 stars 20 forks source link

Specify that class must be a single value #22

Open kba opened 8 years ago

kba commented 8 years ago

It is not stated explicitly but it seems consensus among implementations that the special classes like ocr_page, ocrx_word etc. must be the one and only class= of an HTML element.

zuphilip commented 8 years ago

Does the statement https://github.com/tmbdev/hocr-tools/blob/master/hocr-check#L142:

check that only the right attributes are present on the right elements

help, or maybe still not clear what this means...?

kba commented 8 years ago

Well, of course, but what are the right attributes and what are the right elements :wink:

What I mean is you cannot have

<div class="ocr_line pull-right">...</div>

is valid HTML, and it makes sense to, in this case, indicate that the line is right-aligned (and provide CSS that assigns text-align: right to all .pull-right elements. And it would be valid hOCR as far as I can see. But all the tools I've seen will not check that class= contains ocr_line but that class= is equal to ocr_line.

zuphilip commented 8 years ago

Yeah, understand the differences (this appears commonly in the zotero/translators as well).

I kinda of support this clarification and I guess also we can take the parsing script from the specs as another verification that it was meant to be like so from the beginning.

However, a new released version for this could make sense.

kba commented 8 years ago

The code examples should be replaced, they're using obsolete API. Even in hocr-tools, I see at least three ways of parsing bbox, none of them exactly like this. See also #13

And new version: Sure, not for every typo or reordering but for all additions/changes/removals.