Open kba opened 8 years ago
Does the statement https://github.com/tmbdev/hocr-tools/blob/master/hocr-check#L142:
check that only the right attributes are present on the right elements
help, or maybe still not clear what this means...?
Well, of course, but what are the right attributes and what are the right elements :wink:
What I mean is you cannot have
<div class="ocr_line pull-right">...</div>
is valid HTML, and it makes sense to, in this case, indicate that the line is right-aligned (and provide CSS that assigns text-align: right
to all .pull-right
elements. And it would be valid hOCR as far as I can see. But all the tools I've seen will not check that class=
contains ocr_line
but that class=
is equal to ocr_line
.
Yeah, understand the differences (this appears commonly in the zotero/translators as well).
I kinda of support this clarification and I guess also we can take the parsing script from the specs as another verification that it was meant to be like so from the beginning.
However, a new released version for this could make sense.
The code examples should be replaced, they're using obsolete API. Even in hocr-tools
, I see at least three ways of parsing bbox
, none of them exactly like this. See also #13
And new version: Sure, not for every typo or reordering but for all additions/changes/removals.
It is not stated explicitly but it seems consensus among implementations that the special classes like
ocr_page
,ocrx_word
etc. must be the one and onlyclass=
of an HTML element.