Open kba opened 7 years ago
I think the data-ocr-*
attributes would be a good way to continue. But is there any reason to change the class
as well? This is standard HTML and has very good support like document.getElementsByClassName("ocr_line")
.
It would make it easier to map between formats (ALTO) and serializations, if the OCR application profile of the HTML would be uniform, i.e. you wouldn't force a naming convention on class
, id
or title
.
Reusing the
title=
attribute of HTML elements for OCR-specific values is bad practice. It's understandable since at the time of hOCR's initial development, there were few mechanisms to extend HTML, but in HTML5, there are quite a few.In a (possible) next major revision of the standard, we could use
data-ocr-*
attributes for that purpose.could be expressed as
This is more verbose but it would make it much easier to specify behavior and work with the content, i.e. in Javascript, you could do: